Metro-Records / la-metro-councilmatic

:metro: An instance of councilmatic for LA Metro
MIT License
6 stars 2 forks source link

Sync/Scraping Schedule Documentation #451

Open shrayshray opened 5 years ago

shrayshray commented 5 years ago

Omar and I are putting together a reference for staff about what to expect regarding the syncing/scraping schedule. So if, for example, someone updates a published report in Legistar on a Tuesday afternoon, and wants to know when the change will be reflected on the site, we can give them an accurate time frame (and know when to contact Datamade if the expected updates are not reflected).

Below is latest information Omar and I understand to be correct, plus a few questions. Could you please review and validate and/or correct our assumptions?

Sync/Scraping Schedule Full Scrapes – All events bills are scraped. Windowed Scrapes – Partial scrapes: updates to bills previously scraped.

A. Meetings/Events

  1. Nightly: a. Full scrape at midnight
  2. Sunday – Friday, before 2pm PST: a. Windowed scrapes every 15 min starting on the hour (0,15,30,45)
  3. Friday, 2pm-10:50pm PST: a. Full scrape every hour on the hour. b. Windowed scrapes at 30 and 45 after the hour (30,45)
  4. Friday, after 11 pm, through Saturday: a. Windowed scrapes every 15 min starting on the hour (0,15,30,45)

B. Reports/Bills

  1. Nightly: a. Full scrape at midnight
  2. Sunday – Friday, before 2pm PST: a. Windowed scrapes every 15 min starting 5 after the hour (5,20,35,50)
  3. Friday, 2pm-10:50pm PST: a. Full scrape every hour at 5 after the hour. b. Windowed scrapes at 35 and 50 after the hour (35,50)
  4. Friday, after 11 pm, through Saturday: a. Windowed scrapes every 15 min starting 5 after the hour (5,20,35,50)

C. Live Event Video Links

  1. Is this description still the most currently accurate expected functionality?

Questions:

  1. Full Scrapes: Does this mean that all content in the Legistar system is re-synced regardless of date?
  2. Windowed scrapes: What is the time frame considered for this? Is there an "updated within/since" range? A. Is this compared to the "EventLastModifiedUtc" and "MatterLastModifiedUtc" fields respectively?
  3. What is covered by the Windowed scrapes: A. Events? a. Agenda? i. Download? ii. Cached view? b. Date? c. Time? d. Status? e. Bills? f. Packet B. Bills? a. Report? i. Download? ii. Cached view? b. Attachments c. Packet? d. Status?
jmithani commented 5 years ago

Here is a document explaining the current scraping/syncing schedule: https://docs.google.com/document/d/1h_PSQiO9qK-UaRxIJa5qObX6YQq18ErzRk1N2lMbciM/edit?usp=sharing

shrayshray commented 5 years ago

Awesome, thank you!!! Will review with Omar to see if we need any further clarification, but on initial review, this looks very thorough - thank you!

shrayshray commented 4 years ago

Per @hancush the google doc linked above is out of date. Until it's updated, this is the info Hannah provided about the schedule so we have it for reference until the doc is updated.

Nightly, Saturday through Thursday o 8:05p PDT / 7:05p PST: Regular speed full person, event, and bill scrapes

Saturday through midday Friday o Every 15 minutes: Windowed bill and event scrapes of changes in past 72 minutes*

Support window PDT o 12a to 1:50p: Regular windowed scrapes o 2p to 10:50p: Fast full scrapes at the top of every hour; windowed bill and event scrapes of changes in the past day* twice an hour o 11p onward: Regular windowed scrapes

PST o 12a to 12:50p: Regular windowed scrapes o 1p to 9:50p: Fast full scrapes at the top of every hour; windowed bill and event scrapes of changes in the past day* twice an hour o 10p onward: Regular windowed scrapes

*Note that many changes to bills and events do not update the last updated flag, including toggling an agenda from private to public. This is why we run full scrapes so aggressively when we know changes like this are likely.

You can see that the nightly scrape runs in the middle-ish of the support window. Agendas were posted at 6 p.m. Friday, during the full scrape run. The full scrape prevents other scrapes from running, so the fast full scrapes that should have captured the updated events did not run. The full scrape also took nearly 7 hours to complete, so by the time it was done, the support window had ended and we had reverted to windowed scrapes every 15 minutes.

Because we run fast full scrapes at the top of every hour during the support window, the regular speed full scrape is redundant. So, I'd like to turn off the regular speed full scrape on Fridays, to prevent this from happening again.

hancush commented 4 years ago

Thank you for updating this issue, @shrayshray! Connects https://github.com/datamade/scrapers-us-municipal/issues/38.

neilarellano commented 10 months ago

Hi Team,

Can this be reviewed and updated based on the changes and the recent migration to Heroku?

Thanks!

antidipyramid commented 2 months ago

Part of this pull in the scrapers repo.

antidipyramid commented 3 weeks ago

Documentation is located here.