Closed hancush closed 4 years ago
Potentially related to https://github.com/opencivicdata/scrapers-us-municipal/issues/328.
I unzipped the most recent log archive:
ubuntu@ip-10-0-0-80:~$ gunzip /tmp/lametro.log.2 -k -c > lametro.log.2
The last Metro scrape we have logs for ran on 7/18:
ubuntu@ip-10-0-0-80:~$ tail -n 100 lametro.log.2
07/18/2020 18:51:32 WARNING legistar: Bill could not be found in web interface: https://webapi.legistar.com/v1/metro/matters/6858
07/18/2020 18:51:32 INFO pupa: save bill 2020-0416 in 2019 as bill_9b266a2c-c951-11ea-82d5-122a3d729da3.json
07/18/2020 18:51:32 INFO scrapelib: HEAD - https://metro.legistar.com/gateway.aspx?m=l&id=6863
07/18/2020 18:51:33 WARNING legistar: Bill could not be found in web interface: https://webapi.legistar.com/v1/metro/matters/6863
07/18/2020 18:51:33 INFO pupa: save bill 2020-0421 in 2019 as bill_9bc74096-c951-11ea-82d5-122a3d729da3.json
07/18/2020 18:51:33 INFO scrapelib: HEAD - https://metro.legistar.com/gateway.aspx?m=l&id=6616
07/18/2020 18:51:34 WARNING legistar: Bill could not be found in web interface: https://webapi.legistar.com/v1/metro/matters/6616
07/18/2020 18:51:34 INFO pupa: save bill 2020-0174 in 2019 as bill_9c572b84-c951-11ea-82d5-122a3d729da3.json
07/18/2020 18:51:34 INFO scrapelib: HEAD - https://metro.legistar.com/gateway.aspx?m=l&id=6877
07/18/2020 18:51:35 WARNING legistar: Bill could not be found in web interface: https://webapi.legistar.com/v1/metro/matters/6877
07/18/2020 18:51:35 INFO pupa: save bill 2020-0435 in 2019 as bill_9cf548f0-c951-11ea-82d5-122a3d729da3.json
07/18/2020 18:51:35 INFO scrapelib: HEAD - https://metro.legistar.com/gateway.aspx?m=l&id=6878
07/18/2020 18:51:36 WARNING legistar: Bill could not be found in web interface: https://webapi.legistar.com/v1/metro/matters/6878
07/18/2020 18:51:36 INFO pupa: save bill 2020-0436 in 2019 as bill_9d88a744-c951-11ea-82d5-122a3d729da3.json
07/18/2020 18:51:36 INFO scrapelib: HEAD - https://metro.legistar.com/gateway.aspx?m=l&id=6893
07/18/2020 18:51:37 WARNING legistar: Bill could not be found in web interface: https://webapi.legistar.com/v1/metro/matters/6893
07/18/2020 18:51:37 INFO pupa: save bill 2020-0451 in 2019 as bill_9e26f4f8-c951-11ea-82d5-122a3d729da3.json
07/18/2020 18:51:37 INFO scrapelib: HEAD - https://metro.legistar.com/gateway.aspx?m=l&id=6906
07/18/2020 18:51:38 WARNING legistar: Bill could not be found in web interface: https://webapi.legistar.com/v1/metro/matters/6906
07/18/2020 18:51:38 INFO pupa: save bill 2020-0464 in 2020 as bill_9ebe73fa-c951-11ea-82d5-122a3d729da3.json
07/18/2020 18:51:38 INFO scrapelib: HEAD - https://metro.legistar.com/gateway.aspx?m=l&id=6721
07/18/2020 18:51:39 WARNING legistar: Bill could not be found in web interface: https://webapi.legistar.com/v1/metro/matters/6721
07/18/2020 18:51:39 INFO pupa: save bill 2020-0279 in 2019 as bill_9f55e4ec-c951-11ea-82d5-122a3d729da3.json
07/18/2020 18:51:39 INFO scrapelib: HEAD - https://metro.legistar.com/gateway.aspx?m=l&id=6913
07/18/2020 18:51:40 WARNING legistar: Bill could not be found in web interface: https://webapi.legistar.com/v1/metro/matters/6913
07/18/2020 18:51:40 INFO pupa: save bill 2020-0471 in 2020 as bill_9ff07c00-c951-11ea-82d5-122a3d729da3.json
07/18/2020 18:51:40 INFO scrapelib: HEAD - https://metro.legistar.com/gateway.aspx?m=l&id=6924
07/18/2020 18:51:41 WARNING legistar: Bill could not be found in web interface: https://webapi.legistar.com/v1/metro/matters/6924
07/18/2020 18:51:41 INFO pupa: save bill 2020-0482 in 2020 as bill_a089326a-c951-11ea-82d5-122a3d729da3.json
07/18/2020 18:51:41 INFO scrapelib: HEAD - https://metro.legistar.com/gateway.aspx?m=l&id=6918
07/18/2020 18:51:42 WARNING legistar: Bill could not be found in web interface: https://webapi.legistar.com/v1/metro/matters/6918
07/18/2020 18:51:42 INFO pupa: save bill 2020-0476 in 2020 as bill_a129eeb2-c951-11ea-82d5-122a3d729da3.json
07/18/2020 18:51:42 INFO scrapelib: HEAD - https://metro.legistar.com/gateway.aspx?m=l&id=6925
07/18/2020 18:51:43 WARNING legistar: Bill could not be found in web interface: https://webapi.legistar.com/v1/metro/matters/6925
07/18/2020 18:51:43 INFO pupa: save bill 2020-0483 in 2020 as bill_a1bd175a-c951-11ea-82d5-122a3d729da3.json
07/18/2020 18:51:43 INFO scrapelib: HEAD - https://metro.legistar.com/gateway.aspx?m=l&id=6898
07/18/2020 18:51:44 WARNING legistar: Bill could not be found in web interface: https://webapi.legistar.com/v1/metro/matters/6898
07/18/2020 18:51:44 INFO pupa: save bill 2020-0456 in 2019 as bill_a25853f0-c951-11ea-82d5-122a3d729da3.json
07/18/2020 18:51:44 INFO scrapelib: HEAD - https://metro.legistar.com/gateway.aspx?m=l&id=6927
07/18/2020 18:51:45 WARNING legistar: Bill could not be found in web interface: https://webapi.legistar.com/v1/metro/matters/6927
07/18/2020 18:51:45 INFO pupa: save bill 2020-0485 in 2020 as bill_a2f4b24a-c951-11ea-82d5-122a3d729da3.json
07/18/2020 18:51:45 INFO scrapelib: HEAD - https://metro.legistar.com/gateway.aspx?m=l&id=6928
07/18/2020 18:51:46 WARNING legistar: Bill could not be found in web interface: https://webapi.legistar.com/v1/metro/matters/6928
lametro (scrape)
bills: {'window': '0.05'}
Not checking sessions...
lametro (scrape)
bills: {'window': '0.05'}
bills scrape:
duration: 0:01:29.129319
objects:
bill: 71
jurisdiction scrape:
duration: 0:00:01.172966
objects:
jurisdiction: 1
organization: 3
post: 18
lametro (import)
bills: {}
events: {}
people: {}
import jurisdictions...
import organizations...
import people...
import posts...
import memberships...
import bills...
import events...
import vote events...
lametro (import)
bills: {}
events: {}
people: {}
import:
bill: 0 new 0 updated 71 noop
jurisdiction: 0 new 0 updated 1 noop
organization: 0 new 0 updated 3 noop
post: 0 new 0 updated 18 noop
lametro (import)
people: {}
bills: {}
events: {}
import jurisdictions...
import organizations...
import people...
import posts...
import memberships...
import bills...
import events...
import vote events...
lametro (import)
people: {}
bills: {}
events: {}
import:
bill: 0 new 0 updated 71 noop
jurisdiction: 0 new 0 updated 1 noop
organization: 0 new 0 updated 3 noop
post: 0 new 0 updated 18 noop
I'm not certain, but when I was spelunking re: https://github.com/opencivicdata/scrapers-us-municipal/issues/328, I checked the max updated date on bills in the Councilmatic database, and it may have been 7/18. Are the scrapes not actually running? We break the Metro scrapes into scrape and import steps, so I would at least expect to see logs for the scrapes, even if the imports fail. Very curious...
We last deployed the scrapers at 9 a.m. 7/17, which is 2 p.m. UTC time. Per the logs, scrapes continued to run through Saturday, 7/18, before "stopping" (or not being logged). So I'm not sure that this was caused by a change to the deployed code.
FWIW, the last scrape coincides with the Saturday windowed scraping schedule:
It's as if scheduling for the new week never kicked in?
This would seem to be the scrape where a huge number of bills were updated:
07/18/2020 00:55:23 INFO scrapelib: HEAD - https://metro.legistar.com/gateway.aspx?m=l&id=6898
07/18/2020 00:55:23 WARNING legistar: Bill could not be found in web interface: https://webapi.legistar.com/v1/metro/matters/6898
07/18/2020 00:55:23 INFO pupa: save bill 2020-0456 in 2019 as bill_450cf4da-c8bb-11ea-b551-122a3d729da3.json
07/18/2020 00:55:23 INFO scrapelib: HEAD - https://metro.legistar.com/gateway.aspx?m=l&id=6928
07/18/2020 00:55:23 WARNING legistar: Bill could not be found in web interface: https://webapi.legistar.com/v1/metro/matters/6928
07/18/2020 00:55:23 INFO scrapelib: HEAD - https://metro.legistar.com/gateway.aspx?m=l&id=6927
07/18/2020 00:55:23 WARNING legistar: Bill could not be found in web interface: https://webapi.legistar.com/v1/metro/matters/6927
07/18/2020 00:55:23 INFO pupa: save bill 2020-0485 in 2020 as bill_45139eca-c8bb-11ea-b551-122a3d729da3.json
lametro (scrape)
bills: {'window': '0'}
Not checking sessions...
lametro (scrape)
bills: {'window': '0'}
bills scrape:
duration: 0:50:18.000070
objects:
bill: 2577
jurisdiction scrape:
duration: 0:00:00.134372
objects:
jurisdiction: 1
organization: 3
post: 18
lametro (import)
events: {}
bills: {}
people: {}
import jurisdictions...
import organizations...
import people...
import posts...
import memberships...
import bills...
import events...
import vote events...
lametro (import)
events: {}
bills: {}
people: {}
import:
----------> bill: 0 new 1906 updated 671 noop <----------
jurisdiction: 0 new 0 updated 1 noop
organization: 0 new 0 updated 3 noop
post: 0 new 0 updated 18 noop
lametro (import)
people: {}
bills: {}
events: {}
import jurisdictions...
import organizations...
import people...
import posts...
import memberships...
import bills...
import events...
import vote events...
lametro (import)
people: {}
bills: {}
events: {}
import:
bill: 0 new 1906 updated 671 noop
jurisdiction: 0 new 0 updated 1 noop
organization: 0 new 0 updated 3 noop
post: 0 new 0 updated 18 noop
Confirming that both of the public bills Shelly provided as examples had inaccessible gateway links during the implicated scrape:
ubuntu@ip-10-0-0-80:~$ grep 07/18/2020 lametro.log.3 | grep 6217
07/18/2020 00:51:52 INFO scrapelib: HEAD - https://metro.legistar.com/gateway.aspx?m=l&id=6217
07/18/2020 00:51:52 WARNING legistar: Bill could not be found in web interface: https://webapi.legistar.com/v1/metro/matters/6217
ubuntu@ip-10-0-0-80:~$ grep 07/18/2020 lametro.log.3 | grep 6573
07/18/2020 00:53:28 INFO scrapelib: HEAD - https://metro.legistar.com/gateway.aspx?m=l&id=6573
07/18/2020 00:53:28 WARNING legistar: Bill could not be found in web interface: https://webapi.legistar.com/v1/metro/matters/6573
Cc https://github.com/opencivicdata/scrapers-us-municipal/issues/328
Hm, it seems like there are a bunch of Metro scrapes just chilling on the OCD server:
ubuntu@ip-10-0-0-80:~$ ps --sort=start_time aux | grep [l]ametro
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
datamade 9472 0.0 0.0 4448 104 ? Ss Jul18 0:00 /bin/sh -c /usr/bin/flock -n /tmp/metroevents.lock -c $APPDIR/scripts/lametro/fast-full-event-scrape.sh >> /tmp/lametro.log
datamade 9474 0.0 0.0 5904 84 ? S Jul18 0:00 /usr/bin/flock -n /tmp/metroevents.lock -c /home/datamade/scrapers-us-municipal/scripts/lametro/fast-full-event-scrape.sh
datamade 9475 0.0 0.0 4448 104 ? S Jul18 0:00 /bin/sh -c /home/datamade/scrapers-us-municipal/scripts/lametro/fast-full-event-scrape.sh
datamade 9476 0.0 0.0 4448 100 ? S Jul18 0:00 /bin/sh /home/datamade/scrapers-us-municipal/scripts/lametro/fast-full-event-scrape.sh
datamade 9480 0.0 7.3 235912 151584 ? S Jul18 0:44 /home/datamade/.virtualenvs/opencivicdata/bin/python3.5 /home/datamade/.virtualenvs/opencivicdata/bin/pupa update --datadir=/cache/events/_data/ lametro --scrape events --rpm=0
datamade 30949 0.0 0.0 4448 108 ? Ss Jul19 0:00 /bin/sh -c /usr/bin/flock /tmp/metrobills.lock /usr/bin/flock /tmp/metroevents.lock $APPDIR/scripts/lametro/full-scrape.sh >> /tmp/lametro.log
datamade 30951 0.0 0.0 5904 92 ? S Jul19 0:00 /usr/bin/flock /tmp/metrobills.lock /usr/bin/flock /tmp/metroevents.lock /home/datamade/scrapers-us-municipal/scripts/lametro/full-scrape.sh
datamade 30955 0.0 0.0 5904 68 ? S Jul19 0:00 /usr/bin/flock /tmp/metroevents.lock /home/datamade/scrapers-us-municipal/scripts/lametro/full-scrape.sh
datamade 29715 0.0 0.0 4448 108 ? Ss Jul20 0:00 /bin/sh -c /usr/bin/flock /tmp/metrobills.lock /usr/bin/flock /tmp/metroevents.lock $APPDIR/scripts/lametro/full-scrape.sh >> /tmp/lametro.log
datamade 29721 0.0 0.0 5904 64 ? S Jul20 0:00 /usr/bin/flock /tmp/metrobills.lock /usr/bin/flock /tmp/metroevents.lock /home/datamade/scrapers-us-municipal/scripts/lametro/full-scrape.sh
datamade 27298 0.0 0.0 4448 104 ? Ss Jul21 0:00 /bin/sh -c /usr/bin/flock /tmp/metrobills.lock /usr/bin/flock /tmp/metroevents.lock $APPDIR/scripts/lametro/full-scrape.sh >> /tmp/lametro.log
datamade 27301 0.0 0.0 5904 72 ? S Jul21 0:00 /usr/bin/flock /tmp/metrobills.lock /usr/bin/flock /tmp/metroevents.lock /home/datamade/scrapers-us-municipal/scripts/lametro/full-scrape.sh
datamade 18780 0.0 0.0 4448 268 ? Ss Jul22 0:00 /bin/sh -c /usr/bin/flock /tmp/metrobills.lock /usr/bin/flock /tmp/metroevents.lock $APPDIR/scripts/lametro/full-scrape.sh >> /tmp/lametro.log
datamade 18781 0.0 0.0 5904 68 ? S Jul22 0:00 /usr/bin/flock /tmp/metrobills.lock /usr/bin/flock /tmp/metroevents.lock /home/datamade/scrapers-us-municipal/scripts/lametro/full-scrape.sh
datamade 24007 0.0 0.0 4448 268 ? Ss 00:05 0:00 /bin/sh -c /usr/bin/flock /tmp/metrobills.lock /usr/bin/flock /tmp/metroevents.lock $APPDIR/scripts/lametro/full-scrape.sh >> /tmp/lametro.log
datamade 24012 0.0 0.0 5904 68 ? S 00:05 0:00 /usr/bin/flock /tmp/metrobills.lock /usr/bin/flock /tmp/metroevents.lock /home/datamade/scrapers-us-municipal/scripts/lametro/full-scrape.sh
This would explain why there are no logs – old file locks are preventing the new scrapes from running. How did this happen? And is it safe to terminate them?
We use flock
to manage locking files. Usually, we do this with the -n
flag so commands fail, rather than block, if they can't acquire a lock.
However! It looks like the full scrape does not have this option:
So, that might explain the build up of full scrapes. But why are the windowed scrapes hanging out???
More on flock
: https://linux.die.net/man/1/flock
Oh, there's just the one fast full event scrape causing the bottleneck... So why is that hung????
Decided that it wasn't worth the time investment to investigate the bottleneck since we are moving away from this pattern for Metro. Instead, I killed each of the blocking processes with sudo kill -9
, then (superstitiously) cleared the locks with flock -u /tmp/${LOCKNAME}.lock echo 'dropped
(-u
is unlock).
Hark! The next regularly scheduled event scrape ran and logged as expected.
I'm trying to debug some strange behavior in the Metro scraper. I'm noticing that the main log (
/tmp/lametro.log
) is empty, and the log before (/tmp/lametro.log.1
) only has logs from the scrape I ran manually, and nothing else.I searched the syslog to confirm scrapes were running, and it seems like they are, and that they're configured to append logs to the expected log.
So why don't we have any logs????
Importantly, both the Chicago and NYC scrapers seem to be updating their respective logs as expected!