datamade / scrapers-us-municipal

Scrapers for US municipal governments.
MIT License
10 stars 8 forks source link

Investigate more efficient Friday-night scraping approach #38

Closed hancush closed 4 years ago

hancush commented 5 years ago

We run full and windowed scrapes on Friday, however we preclude multiple scrapes from running at once, so in theory, a full scrape could block a windowed scrape and prevent recent changes from appearing for quite a while. Let's look into a Friday-night approach that balances efficiency with completeness.

hancush commented 4 years ago

On Friday, we run full event and bill scrapes at the top of every hour. That means most of the regular full scrape is redundant. I propose we nix the regular full scrape on Friday and run only a person scrape, instead. This should remove the blocker!

hancush commented 4 years ago

The full bill scrape took almost seven hours last night!!!

lametro (scrape)
  bills: {'window': '0'}
bills scrape:
  duration:  6:45:34.103592
  objects:
    bill: 3083
    vote_event: 1489
jurisdiction scrape:
  duration:  0:00:00.158219
  objects:
    jurisdiction: 1
    organization: 3
    post: 18
04/11/2020 03:16:21 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "Planning and Development (Department)"}
04/11/2020 03:16:26 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "Operations (Department)"}
04/11/2020 03:16:28 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "Program Management (Department)"}
04/11/2020 03:16:28 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "Maria Luk"}
04/11/2020 03:16:36 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "-"}
04/11/2020 03:16:41 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "Fe Dalida"}
04/11/2020 03:17:43 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "Chris Reyes"}
04/11/2020 03:18:59 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "James Butts"}
04/11/2020 03:18:59 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "Jacquelyn Dupont-Walker"}
04/11/2020 03:18:59 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "Ara Najarian"}
04/11/2020 03:19:19 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "Martha Welborne"}
04/11/2020 03:19:40 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "OCEO (Department)"}
lametro (import)
  people: {}
  events: {}
  bills: {}
import jurisdictions...
import organizations...
import people...
import posts...
import memberships...
import bills...
import events...
import vote events...
lametro (import)
  people: {}
  events: {}
  bills: {}
import:
  bill: 0 new 0 updated 3083 noop
  jurisdiction: 0 new 0 updated 1 noop
  organization: 0 new 0 updated 3 noop
  post: 0 new 0 updated 18 noop
  vote_event: 0 new 0 updated 1489 noop
hancush commented 4 years ago

In other words, the slow full scrape blocked other scrapes for almost the entire support window. 😓

Manually ran a full event scrape to post agendas this morning.

lametro (scrape)
  events: {}
events scrape:
  duration:  0:06:34.156762
  objects:
    event: 391
jurisdiction scrape:
  duration:  0:00:01.411950
  objects:
    jurisdiction: 1
    organization: 3
    post: 18
lametro (import)
  people: {}
  events: {}
  bills: {}
import jurisdictions...
import organizations...
import people...
import posts...
import memberships...
import bills...
import events...
import vote events...
lametro (import)
  people: {}
  events: {}
  bills: {}
import:
  event: 1 new 6 updated 384 noop
  jurisdiction: 0 new 0 updated 1 noop
  organization: 0 new 0 updated 3 noop
  post: 0 new 0 updated 18 noop
hancush commented 4 years ago

We addressed this.