datamade / scrapers-us-municipal

Scrapers for US municipal governments.
MIT License
10 stars 8 forks source link

Potential way to speed scraping for LA Metro #23

Open fgregg opened 5 years ago

fgregg commented 5 years ago

Right now we are aggressively scraping all events and all bills on Friday afternoon and evenings to deal with changes to events and bills not modifying appropriate fields so we can catch updated information in a windowed search.

The bill scrape takes 22 minutes, which means that the maximum latency between a bill being updated by LAMetro and appearing on the councilmatic site is 22 minutes + polling frequency of import_data + time for import_data to run.

Since what LA Metro really cares about on Friday is that the agendas are accurate, we could take a somewhat different strategy that should decrease that latency.

  1. Go back to windowed search for updated bills
  2. Capture the unresolved bills from event scrapes. Direct a bill scraper to only try to scrape those unresolved bills.
fgregg commented 5 years ago

For your consideration, @reginafcompton. Not time sensitive.

reginafcompton commented 5 years ago

I like the second proposal. A couple details that we need to think about:

  1. how the scraper can ingest the bill identifiers - right now, it uses matter_ids: https://github.com/opencivicdata/scrapers-us-municipal/blob/master/lametro/bills.py#L97
  2. what the scraper should do if it cannot find a bill....raising a unique error seems like it would put us back a issue #24 . Maybe just log it and skip it.

I know there's more to consider, but just noting some challenges that immediately come to mind.