code4sac / sacramento-campaign-finance

Dataset and dashboard of money in local politics
https://sacramento-campaign-cash.netlify.app/
2 stars 5 forks source link

GitHub Action for web scraper is broken #12

Closed natebass closed 11 months ago

natebass commented 1 year ago

The automated downloading of new data with GitHub Actions is failing while running scripts/index.js. The error is caused by puppeteer exceeding navigation timeout of 1 minute.

Output log

Run node scripts/index.js
##[debug]/usr/bin/bash -e /home/runner/work/_temp/999e91d3-59a1-49b2-a485-79af21dc99f1.sh
Starting at Sat Aug 05 2023 22:29:12 GMT+0000 (Coordinated Universal Time)
Ok, running for City Council (SAC - 2023)
Downloading City Council (SAC - 2023)...
Ok, running for Board of Supervisors (SCO - 2023)
file:///home/runner/work/sacramento-campaign-finance/sacramento-campaign-finance/node_modules/puppeteer-core/lib/esm/puppeteer/util/Deferred.js:24
                    this.reject(new TimeoutError(opts.message));
                                ^

TimeoutError: Navigation timeout of 60000 ms exceeded
    at Timeout.<anonymous> (file:///home/runner/work/sacramento-campaign-finance/sacramento-campaign-finance/node_modules/puppeteer-core/lib/esm/puppeteer/util/Deferred.js:24:33)
    at listOnTimeout (node:internal/timers:559:17)
    at processTimers (node:internal/timers:502:7)
Error: Process completed with exit code 1.
jeremiak commented 1 year ago

Hmm, for some reason it looks like it recovered by itself in this run from yesterday?

natebass commented 1 year ago

Interesting, we can keep an eye on it.

I don't think it's related to the issue from the puppeteer headless change, because that was reverted and it shows the deprecation error on both succeeding and failing.

Possible solutions:

Successful log:

Run node scripts/index.js
Starting at Sat Aug 12 2023 01:04:07 GMT+0000 (Coordinated Universal Time)
Ok, running for City Council (SAC - 2023)
Downloading City Council (SAC - 2023)...

  Puppeteer old Headless deprecation warning:
    In the near feature `headless: true` will default to the new Headless mode
    for Chrome instead of the old Headless implementation. For more
    information, please see [https://developer.chrome.com/articles/new-headless/.](https://developer.chrome.com/articles/new-headless/)
    Consider opting in early by passing `headless: "new"` to `puppeteer.launch()`
    If you encounter any bugs, please report them to https://github.com/puppeteer/puppeteer/issues/new/choose.

Downloaded City Council (SAC - 2023)
Extracting City Council (SAC - 2023)...
Extracted City Council (SAC - 2023)
Transforming City Council (SAC - 2023)...
Transformed City Council (SAC - 2023)
Done with City Council (SAC - 2023)
Ok, running for Board of Supervisors (SCO - 2023)
Downloading Board of Supervisors (SCO - 2023)...

  Puppeteer old Headless deprecation warning:
    In the near feature `headless: true` will default to the new Headless mode
    for Chrome instead of the old Headless implementation. For more
    information, please see [https://developer.chrome.com/articles/new-headless/.](https://developer.chrome.com/articles/new-headless/)
    Consider opting in early by passing `headless: "new"` to `puppeteer.launch()`
    If you encounter any bugs, please report them to https://github.com/puppeteer/puppeteer/issues/new/choose.

Downloaded Board of Supervisors (SCO - 2023)
Extracting Board of Supervisors (SCO - 2023)...
Extracted Board of Supervisors (SCO - 2023)
Transforming Board of Supervisors (SCO - 2023)...
Transformed Board of Supervisors (SCO - 2023)
Done with Board of Supervisors (SCO - 2023)
Loading JSON files into one database
Finished at Sat Aug 12 2023 01:05:08 GMT+0000 (Coordinated Universal Time), took about 2 minutes

Failing log:

Run node scripts/index.js
Starting at Fri Aug 11 2023 01:05:03 GMT+0000 (Coordinated Universal Time)
Ok, running for City Council (SAC - 2023)

Downloading City Council (SAC - 2023)...
  Puppeteer old Headless deprecation warning:
    In the near feature `headless: true` will default to the new Headless mode
    for Chrome instead of the old Headless implementation. For more
    information, please see [https://developer.chrome.com/articles/new-headless/.](https://developer.chrome.com/articles/new-headless/)
    Consider opting in early by passing `headless: "new"` to `puppeteer.launch()`
    If you encounter any bugs, please report them to https://github.com/puppeteer/puppeteer/issues/new/choose.

Ok, running for Board of Supervisors (SCO - 2023)
file:///home/runner/work/sacramento-campaign-finance/sacramento-campaign-finance/node_modules/puppeteer-core/lib/esm/puppeteer/util/Deferred.js:24
                    this.reject(new TimeoutError(opts.message));
                                ^

TimeoutError: Navigation timeout of 120000 ms exceeded
    at Timeout.<anonymous> (file:///home/runner/work/sacramento-campaign-finance/sacramento-campaign-finance/node_modules/puppeteer-core/lib/esm/puppeteer/util/Deferred.js:24:33)
    at listOnTimeout (node:internal/timers:559:17)
    at processTimers (node:internal/timers:502:7)
Error: Process completed with exit code 1.
jeremiak commented 1 year ago

Seems like it's broken again, as in this recent Github action run. But it's weird because it works totally fine locally on my Mac when I run node scripts/index.js.

jeremiak commented 1 year ago

The step that breaks seems to be the download step which involves a headless browser, Puppeteer specifically. But it turns out that Netfile has an API, including a route to download a CSV version of the data. Maybe we don't need an automated browser at all?

Here's the route for the current year in Sacramento (SAC):

https://netfile.com/Connect2/api/public/campaign/export/cal201/transaction/year/csv?Aid=SAC&Year=2023&format=csv

And Sac County (SCO):

https://netfile.com/Connect2/api/public/campaign/export/cal201/transaction/year/csv?Aid=SCO&Year=2023&format=csv

jeremiak commented 1 year ago

And it's working again.

jeremiak commented 1 year ago

This thing is starting to look like a set of Christmas lights :/

Screen Shot 2023-09-25 at 7 39 05 PM

natebass commented 11 months ago

This is fixed by setting puppeteer's timeout to infinite https://github.com/code4sac/sacramento-campaign-finance/commit/74a86df254fe873258ce364ec6586d702cddb5a6. We have successful runs, although they sometimes take 17 minutes for some reason.
Screenshot 2023-11-08 152248