datamade / scrapers-us-municipal

Scrapers for US municipal governments.
MIT License
10 stars 8 forks source link

Where are the Metro logs? #55

Closed hancush closed 4 years ago

hancush commented 4 years ago

I'm trying to debug some strange behavior in the Metro scraper. I'm noticing that the main log (/tmp/lametro.log) is empty, and the log before (/tmp/lametro.log.1) only has logs from the scrape I ran manually, and nothing else.

ubuntu@ip-10-0-0-80:~$ wc -l /tmp/lametro.log
0 /tmp/lametro.log

I searched the syslog to confirm scrapes were running, and it seems like they are, and that they're configured to append logs to the expected log.

ubuntu@ip-10-0-0-80:~$ grep CRON /var/log/syslog | grep lametro
...
Jul 22 13:00:01 ip-10-0-0-80 CRON[29955]: (datamade) CMD (/usr/bin/flock -n /tmp/metroevents.lock -c "WINDOW=0.05 $APPDIR/scripts/lametro/windowed-event-scrape.sh" >> /tmp/lametro.log)
Jul 22 13:05:01 ip-10-0-0-80 CRON[32387]: (datamade) CMD (/usr/bin/flock -n /tmp/metrobills.lock -c "WINDOW=0.05 $APPDIR/scripts/lametro/windowed-bill-scrape.sh" >> /tmp/lametro.log)
Jul 22 13:15:01 ip-10-0-0-80 CRON[8136]: (datamade) CMD (/usr/bin/flock -n /tmp/metroevents.lock -c "WINDOW=0.05 $APPDIR/scripts/lametro/windowed-event-scrape.sh" >> /tmp/lametro.log)
Jul 22 13:20:01 ip-10-0-0-80 CRON[10888]: (datamade) CMD (/usr/bin/flock -n /tmp/metrobills.lock -c "WINDOW=0.05 $APPDIR/scripts/lametro/windowed-bill-scrape.sh" >> /tmp/lametro.log)

So why don't we have any logs????

Importantly, both the Chicago and NYC scrapers seem to be updating their respective logs as expected!

ubuntu@ip-10-0-0-80:~$ tail -n 50 /tmp/chicago.log
07/22/2020 08:25:57 INFO pupa: virtual hearing
07/22/2020 08:25:57 INFO scrapelib: GET - http://webapi.legistar.com/v1/chicago/events/5693/eventitems
07/22/2020 08:25:57 INFO scrapelib: GET - http://webapi.legistar.com/v1/chicago/events/5693/eventitems
07/22/2020 08:25:58 INFO pupa: save event 2020-07-30 10:00:00-05:00 Committee on Enviromental Protection and Energy as event_e06d6264-cc1e-11ea-9243-122a3d729da3.json
07/22/2020 08:25:58 INFO scrapelib: GET - https://chicago.legistar.com/MeetingDetail.aspx?LEGID=5694&GID=106&G=503964DF-290E-455B-86CC-7B03A231E984
/home/datamade/.virtualenvs/opencivicdata/lib/python3.5/site-packages/urllib3/connectionpool.py:1004: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
07/22/2020 08:25:59 INFO pupa: virtual meeting
07/22/2020 08:25:59 INFO scrapelib: GET - http://webapi.legistar.com/v1/chicago/events/5694/eventitems
07/22/2020 08:25:59 INFO scrapelib: GET - http://webapi.legistar.com/v1/chicago/events/5694/eventitems
07/22/2020 08:26:00 INFO pupa: save event 2020-07-29 10:00:00-05:00 Committee on Public Safety as event_e15ce50a-cc1e-11ea-9243-122a3d729da3.json
07/22/2020 08:26:00 INFO scrapelib: GET - https://chicago.legistar.com/MeetingDetail.aspx?LEGID=5698&GID=106&G=503964DF-290E-455B-86CC-7B03A231E984
/home/datamade/.virtualenvs/opencivicdata/lib/python3.5/site-packages/urllib3/connectionpool.py:1004: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
07/22/2020 08:26:00 INFO pupa: virtual meeting
07/22/2020 08:26:00 INFO scrapelib: GET - http://webapi.legistar.com/v1/chicago/events/5698/eventitems
07/22/2020 08:26:01 INFO scrapelib: GET - http://webapi.legistar.com/v1/chicago/events/5698/eventitems
07/22/2020 08:26:02 INFO pupa: save event 2020-09-02 10:00:00-05:00 Committee on Budget and Government Operations as event_e280c85c-cc1e-11ea-9243-122a3d729da3.json
chicago (scrape, import)
  events: {'window': '0.05'}
Not checking sessions...
import jurisdictions...
import organizations...
import people...
import posts...
import memberships...
import bills...
import events...
import vote events...
chicago (scrape, import)
  events: {'window': '0.05'}
events scrape:
  duration:  0:00:37.681782
  objects:
    event: 13
jurisdiction scrape:
  duration:  0:00:00.290168
  objects:
    jurisdiction: 1
    membership: 6
    organization: 3
    person: 6
    post: 52
import:
  event: 0 new 0 updated 13 noop
  jurisdiction: 0 new 0 updated 1 noop
  membership: 0 new 0 updated 6 noop
  organization: 0 new 0 updated 3 noop
  person: 0 new 0 updated 6 noop
  post: 0 new 0 updated 52 noop
ubuntu@ip-10-0-0-80:~$ tail -n 50 /tmp/nyc.log
/home/datamade/.virtualenvs/opencivicdata/lib/python3.5/site-packages/urllib3/connectionpool.py:1004: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
07/22/2020 08:30:37 INFO scrapelib: POST - https://legistar.council.nyc.gov/Calendar.aspx/
/home/datamade/.virtualenvs/opencivicdata/lib/python3.5/site-packages/urllib3/connectionpool.py:1004: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
07/22/2020 08:30:37 INFO scrapelib: GET - https://legistar.council.nyc.gov/View.ashx?M=IC&ID=797385&GUID=4A687CD7-993D-436D-A16B-F35EBD0B128A
/home/datamade/.virtualenvs/opencivicdata/lib/python3.5/site-packages/urllib3/connectionpool.py:1004: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
07/22/2020 08:30:38 INFO scrapelib: GET - https://webapi.legistar.com/v1/nyc/events/18038/eventitems
07/22/2020 08:30:38 INFO scrapelib: GET - https://webapi.legistar.com/v1/nyc/events/18038/eventitems
07/22/2020 08:30:44 INFO pupa: save event 2020-07-28 13:30:00-04:00 City Council as event_87c8bf04-cc1f-11ea-bfc6-122a3d729da3.json
07/22/2020 08:30:44 WARNING legistar: API event could not be found in web interface: https://webapi.legistar.com/v1/nyc/events/18040
07/22/2020 08:30:44 WARNING legistar: API event could not be found in web interface: https://webapi.legistar.com/v1/nyc/events/18041
07/22/2020 08:30:44 INFO scrapelib: GET - https://legistar.council.nyc.gov/View.ashx?M=IC&ID=798357&GUID=3393142C-96BB-4845-AC8E-6657EBC093B0
/home/datamade/.virtualenvs/opencivicdata/lib/python3.5/site-packages/urllib3/connectionpool.py:1004: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
07/22/2020 08:30:44 INFO scrapelib: GET - https://webapi.legistar.com/v1/nyc/events/18039/eventitems
07/22/2020 08:30:45 INFO scrapelib: GET - https://webapi.legistar.com/v1/nyc/events/18039/eventitems
07/22/2020 08:30:51 INFO pupa: save event 2020-07-23 14:30:00-04:00 Democratic Conference of the Council of the City of New York as event_8bb5bf54-cc1f-11ea-bfc6-122a3d729da3.json
07/22/2020 08:30:51 WARNING legistar: API event could not be found in web interface: https://webapi.legistar.com/v1/nyc/events/18043
07/22/2020 08:30:51 WARNING legistar: API event could not be found in web interface: https://webapi.legistar.com/v1/nyc/events/18044
07/22/2020 08:30:53 ERROR pupa: cannot resolve pseudo id to Bill: ~{"identifier": "T2020-6410"}
nyc (scrape, import)
  events: {'window': '0.05'}
Not checking sessions...
import jurisdictions...
import organizations...
import people...
import posts...
import memberships...
import bills...
import events...
import vote events...
nyc (scrape, import)
  events: {'window': '0.05'}
events scrape:
  duration:  0:00:26.450395
  objects:
    event: 2
jurisdiction scrape:
  duration:  0:00:00.210824
  objects:
    jurisdiction: 1
    organization: 4
    post: 51
import:
  event: 0 new 0 updated 2 noop
  jurisdiction: 0 new 0 updated 1 noop
  organization: 0 new 0 updated 4 noop
  post: 0 new 0 updated 51 noop
hancush commented 4 years ago

Potentially related to https://github.com/opencivicdata/scrapers-us-municipal/issues/328.

hancush commented 4 years ago

I unzipped the most recent log archive:

ubuntu@ip-10-0-0-80:~$ gunzip /tmp/lametro.log.2 -k -c > lametro.log.2

The last Metro scrape we have logs for ran on 7/18:

ubuntu@ip-10-0-0-80:~$ tail -n 100 lametro.log.2
07/18/2020 18:51:32 WARNING legistar: Bill could not be found in web interface: https://webapi.legistar.com/v1/metro/matters/6858
07/18/2020 18:51:32 INFO pupa: save bill 2020-0416 in 2019 as bill_9b266a2c-c951-11ea-82d5-122a3d729da3.json
07/18/2020 18:51:32 INFO scrapelib: HEAD - https://metro.legistar.com/gateway.aspx?m=l&id=6863
07/18/2020 18:51:33 WARNING legistar: Bill could not be found in web interface: https://webapi.legistar.com/v1/metro/matters/6863
07/18/2020 18:51:33 INFO pupa: save bill 2020-0421 in 2019 as bill_9bc74096-c951-11ea-82d5-122a3d729da3.json
07/18/2020 18:51:33 INFO scrapelib: HEAD - https://metro.legistar.com/gateway.aspx?m=l&id=6616
07/18/2020 18:51:34 WARNING legistar: Bill could not be found in web interface: https://webapi.legistar.com/v1/metro/matters/6616
07/18/2020 18:51:34 INFO pupa: save bill 2020-0174 in 2019 as bill_9c572b84-c951-11ea-82d5-122a3d729da3.json
07/18/2020 18:51:34 INFO scrapelib: HEAD - https://metro.legistar.com/gateway.aspx?m=l&id=6877
07/18/2020 18:51:35 WARNING legistar: Bill could not be found in web interface: https://webapi.legistar.com/v1/metro/matters/6877
07/18/2020 18:51:35 INFO pupa: save bill 2020-0435 in 2019 as bill_9cf548f0-c951-11ea-82d5-122a3d729da3.json
07/18/2020 18:51:35 INFO scrapelib: HEAD - https://metro.legistar.com/gateway.aspx?m=l&id=6878
07/18/2020 18:51:36 WARNING legistar: Bill could not be found in web interface: https://webapi.legistar.com/v1/metro/matters/6878
07/18/2020 18:51:36 INFO pupa: save bill 2020-0436 in 2019 as bill_9d88a744-c951-11ea-82d5-122a3d729da3.json
07/18/2020 18:51:36 INFO scrapelib: HEAD - https://metro.legistar.com/gateway.aspx?m=l&id=6893
07/18/2020 18:51:37 WARNING legistar: Bill could not be found in web interface: https://webapi.legistar.com/v1/metro/matters/6893
07/18/2020 18:51:37 INFO pupa: save bill 2020-0451 in 2019 as bill_9e26f4f8-c951-11ea-82d5-122a3d729da3.json
07/18/2020 18:51:37 INFO scrapelib: HEAD - https://metro.legistar.com/gateway.aspx?m=l&id=6906
07/18/2020 18:51:38 WARNING legistar: Bill could not be found in web interface: https://webapi.legistar.com/v1/metro/matters/6906
07/18/2020 18:51:38 INFO pupa: save bill 2020-0464 in 2020 as bill_9ebe73fa-c951-11ea-82d5-122a3d729da3.json
07/18/2020 18:51:38 INFO scrapelib: HEAD - https://metro.legistar.com/gateway.aspx?m=l&id=6721
07/18/2020 18:51:39 WARNING legistar: Bill could not be found in web interface: https://webapi.legistar.com/v1/metro/matters/6721
07/18/2020 18:51:39 INFO pupa: save bill 2020-0279 in 2019 as bill_9f55e4ec-c951-11ea-82d5-122a3d729da3.json
07/18/2020 18:51:39 INFO scrapelib: HEAD - https://metro.legistar.com/gateway.aspx?m=l&id=6913
07/18/2020 18:51:40 WARNING legistar: Bill could not be found in web interface: https://webapi.legistar.com/v1/metro/matters/6913
07/18/2020 18:51:40 INFO pupa: save bill 2020-0471 in 2020 as bill_9ff07c00-c951-11ea-82d5-122a3d729da3.json
07/18/2020 18:51:40 INFO scrapelib: HEAD - https://metro.legistar.com/gateway.aspx?m=l&id=6924
07/18/2020 18:51:41 WARNING legistar: Bill could not be found in web interface: https://webapi.legistar.com/v1/metro/matters/6924
07/18/2020 18:51:41 INFO pupa: save bill 2020-0482 in 2020 as bill_a089326a-c951-11ea-82d5-122a3d729da3.json
07/18/2020 18:51:41 INFO scrapelib: HEAD - https://metro.legistar.com/gateway.aspx?m=l&id=6918
07/18/2020 18:51:42 WARNING legistar: Bill could not be found in web interface: https://webapi.legistar.com/v1/metro/matters/6918
07/18/2020 18:51:42 INFO pupa: save bill 2020-0476 in 2020 as bill_a129eeb2-c951-11ea-82d5-122a3d729da3.json
07/18/2020 18:51:42 INFO scrapelib: HEAD - https://metro.legistar.com/gateway.aspx?m=l&id=6925
07/18/2020 18:51:43 WARNING legistar: Bill could not be found in web interface: https://webapi.legistar.com/v1/metro/matters/6925
07/18/2020 18:51:43 INFO pupa: save bill 2020-0483 in 2020 as bill_a1bd175a-c951-11ea-82d5-122a3d729da3.json
07/18/2020 18:51:43 INFO scrapelib: HEAD - https://metro.legistar.com/gateway.aspx?m=l&id=6898
07/18/2020 18:51:44 WARNING legistar: Bill could not be found in web interface: https://webapi.legistar.com/v1/metro/matters/6898
07/18/2020 18:51:44 INFO pupa: save bill 2020-0456 in 2019 as bill_a25853f0-c951-11ea-82d5-122a3d729da3.json
07/18/2020 18:51:44 INFO scrapelib: HEAD - https://metro.legistar.com/gateway.aspx?m=l&id=6927
07/18/2020 18:51:45 WARNING legistar: Bill could not be found in web interface: https://webapi.legistar.com/v1/metro/matters/6927
07/18/2020 18:51:45 INFO pupa: save bill 2020-0485 in 2020 as bill_a2f4b24a-c951-11ea-82d5-122a3d729da3.json
07/18/2020 18:51:45 INFO scrapelib: HEAD - https://metro.legistar.com/gateway.aspx?m=l&id=6928
07/18/2020 18:51:46 WARNING legistar: Bill could not be found in web interface: https://webapi.legistar.com/v1/metro/matters/6928
lametro (scrape)
  bills: {'window': '0.05'}
Not checking sessions...
lametro (scrape)
  bills: {'window': '0.05'}
bills scrape:
  duration:  0:01:29.129319
  objects:
    bill: 71
jurisdiction scrape:
  duration:  0:00:01.172966
  objects:
    jurisdiction: 1
    organization: 3
    post: 18
lametro (import)
  bills: {}
  events: {}
  people: {}
import jurisdictions...
import organizations...
import people...
import posts...
import memberships...
import bills...
import events...
import vote events...
lametro (import)
  bills: {}
  events: {}
  people: {}
import:
  bill: 0 new 0 updated 71 noop
  jurisdiction: 0 new 0 updated 1 noop
  organization: 0 new 0 updated 3 noop
  post: 0 new 0 updated 18 noop
lametro (import)
  people: {}
  bills: {}
  events: {}
import jurisdictions...
import organizations...
import people...
import posts...
import memberships...
import bills...
import events...
import vote events...
lametro (import)
  people: {}
  bills: {}
  events: {}
import:
  bill: 0 new 0 updated 71 noop
  jurisdiction: 0 new 0 updated 1 noop
  organization: 0 new 0 updated 3 noop
  post: 0 new 0 updated 18 noop

I'm not certain, but when I was spelunking re: https://github.com/opencivicdata/scrapers-us-municipal/issues/328, I checked the max updated date on bills in the Councilmatic database, and it may have been 7/18. Are the scrapes not actually running? We break the Metro scrapes into scrape and import steps, so I would at least expect to see logs for the scrapes, even if the imports fail. Very curious...

hancush commented 4 years ago

We last deployed the scrapers at 9 a.m. 7/17, which is 2 p.m. UTC time. Per the logs, scrapes continued to run through Saturday, 7/18, before "stopping" (or not being logged). So I'm not sure that this was caused by a change to the deployed code.

hancush commented 4 years ago

FWIW, the last scrape coincides with the Saturday windowed scraping schedule:

https://github.com/datamade/scrapers-us-municipal/blob/27df297174b169127e808dfb94bd3ce5a650c981/scripts/la-metro-crontask#L67-L74

It's as if scheduling for the new week never kicked in?

hancush commented 4 years ago

This would seem to be the scrape where a huge number of bills were updated:

07/18/2020 00:55:23 INFO scrapelib: HEAD - https://metro.legistar.com/gateway.aspx?m=l&id=6898
07/18/2020 00:55:23 WARNING legistar: Bill could not be found in web interface: https://webapi.legistar.com/v1/metro/matters/6898
07/18/2020 00:55:23 INFO pupa: save bill 2020-0456 in 2019 as bill_450cf4da-c8bb-11ea-b551-122a3d729da3.json
07/18/2020 00:55:23 INFO scrapelib: HEAD - https://metro.legistar.com/gateway.aspx?m=l&id=6928
07/18/2020 00:55:23 WARNING legistar: Bill could not be found in web interface: https://webapi.legistar.com/v1/metro/matters/6928
07/18/2020 00:55:23 INFO scrapelib: HEAD - https://metro.legistar.com/gateway.aspx?m=l&id=6927
07/18/2020 00:55:23 WARNING legistar: Bill could not be found in web interface: https://webapi.legistar.com/v1/metro/matters/6927
07/18/2020 00:55:23 INFO pupa: save bill 2020-0485 in 2020 as bill_45139eca-c8bb-11ea-b551-122a3d729da3.json
lametro (scrape)
  bills: {'window': '0'}
Not checking sessions...
lametro (scrape)
  bills: {'window': '0'}
bills scrape:
  duration:  0:50:18.000070
  objects:
    bill: 2577
jurisdiction scrape:
  duration:  0:00:00.134372
  objects:
    jurisdiction: 1
    organization: 3
    post: 18
lametro (import)
  events: {}
  bills: {}
  people: {}
import jurisdictions...
import organizations...
import people...
import posts...
import memberships...
import bills...
import events...
import vote events...
lametro (import)
  events: {}
  bills: {}
  people: {}
import:
---------->  bill: 0 new 1906 updated 671 noop <----------
  jurisdiction: 0 new 0 updated 1 noop
  organization: 0 new 0 updated 3 noop
  post: 0 new 0 updated 18 noop
lametro (import)
  people: {}
  bills: {}
  events: {}
import jurisdictions...
import organizations...
import people...
import posts...
import memberships...
import bills...
import events...
import vote events...
lametro (import)
  people: {}
  bills: {}
  events: {}
import:
  bill: 0 new 1906 updated 671 noop
  jurisdiction: 0 new 0 updated 1 noop
  organization: 0 new 0 updated 3 noop
  post: 0 new 0 updated 18 noop
hancush commented 4 years ago

Confirming that both of the public bills Shelly provided as examples had inaccessible gateway links during the implicated scrape:

ubuntu@ip-10-0-0-80:~$ grep 07/18/2020 lametro.log.3 | grep 6217
07/18/2020 00:51:52 INFO scrapelib: HEAD - https://metro.legistar.com/gateway.aspx?m=l&id=6217
07/18/2020 00:51:52 WARNING legistar: Bill could not be found in web interface: https://webapi.legistar.com/v1/metro/matters/6217
ubuntu@ip-10-0-0-80:~$ grep 07/18/2020 lametro.log.3 | grep 6573
07/18/2020 00:53:28 INFO scrapelib: HEAD - https://metro.legistar.com/gateway.aspx?m=l&id=6573
07/18/2020 00:53:28 WARNING legistar: Bill could not be found in web interface: https://webapi.legistar.com/v1/metro/matters/6573

Cc https://github.com/opencivicdata/scrapers-us-municipal/issues/328

hancush commented 4 years ago

Hm, it seems like there are a bunch of Metro scrapes just chilling on the OCD server:

ubuntu@ip-10-0-0-80:~$ ps --sort=start_time aux | grep [l]ametro
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
datamade  9472  0.0  0.0   4448   104 ?        Ss   Jul18   0:00 /bin/sh -c /usr/bin/flock -n /tmp/metroevents.lock -c $APPDIR/scripts/lametro/fast-full-event-scrape.sh >> /tmp/lametro.log
datamade  9474  0.0  0.0   5904    84 ?        S    Jul18   0:00 /usr/bin/flock -n /tmp/metroevents.lock -c /home/datamade/scrapers-us-municipal/scripts/lametro/fast-full-event-scrape.sh
datamade  9475  0.0  0.0   4448   104 ?        S    Jul18   0:00 /bin/sh -c /home/datamade/scrapers-us-municipal/scripts/lametro/fast-full-event-scrape.sh
datamade  9476  0.0  0.0   4448   100 ?        S    Jul18   0:00 /bin/sh /home/datamade/scrapers-us-municipal/scripts/lametro/fast-full-event-scrape.sh
datamade  9480  0.0  7.3 235912 151584 ?       S    Jul18   0:44 /home/datamade/.virtualenvs/opencivicdata/bin/python3.5 /home/datamade/.virtualenvs/opencivicdata/bin/pupa update --datadir=/cache/events/_data/ lametro --scrape events --rpm=0
datamade 30949  0.0  0.0   4448   108 ?        Ss   Jul19   0:00 /bin/sh -c /usr/bin/flock /tmp/metrobills.lock /usr/bin/flock /tmp/metroevents.lock $APPDIR/scripts/lametro/full-scrape.sh >> /tmp/lametro.log
datamade 30951  0.0  0.0   5904    92 ?        S    Jul19   0:00 /usr/bin/flock /tmp/metrobills.lock /usr/bin/flock /tmp/metroevents.lock /home/datamade/scrapers-us-municipal/scripts/lametro/full-scrape.sh
datamade 30955  0.0  0.0   5904    68 ?        S    Jul19   0:00 /usr/bin/flock /tmp/metroevents.lock /home/datamade/scrapers-us-municipal/scripts/lametro/full-scrape.sh
datamade 29715  0.0  0.0   4448   108 ?        Ss   Jul20   0:00 /bin/sh -c /usr/bin/flock /tmp/metrobills.lock /usr/bin/flock /tmp/metroevents.lock $APPDIR/scripts/lametro/full-scrape.sh >> /tmp/lametro.log
datamade 29721  0.0  0.0   5904    64 ?        S    Jul20   0:00 /usr/bin/flock /tmp/metrobills.lock /usr/bin/flock /tmp/metroevents.lock /home/datamade/scrapers-us-municipal/scripts/lametro/full-scrape.sh
datamade 27298  0.0  0.0   4448   104 ?        Ss   Jul21   0:00 /bin/sh -c /usr/bin/flock /tmp/metrobills.lock /usr/bin/flock /tmp/metroevents.lock $APPDIR/scripts/lametro/full-scrape.sh >> /tmp/lametro.log
datamade 27301  0.0  0.0   5904    72 ?        S    Jul21   0:00 /usr/bin/flock /tmp/metrobills.lock /usr/bin/flock /tmp/metroevents.lock /home/datamade/scrapers-us-municipal/scripts/lametro/full-scrape.sh
datamade 18780  0.0  0.0   4448   268 ?        Ss   Jul22   0:00 /bin/sh -c /usr/bin/flock /tmp/metrobills.lock /usr/bin/flock /tmp/metroevents.lock $APPDIR/scripts/lametro/full-scrape.sh >> /tmp/lametro.log
datamade 18781  0.0  0.0   5904    68 ?        S    Jul22   0:00 /usr/bin/flock /tmp/metrobills.lock /usr/bin/flock /tmp/metroevents.lock /home/datamade/scrapers-us-municipal/scripts/lametro/full-scrape.sh
datamade 24007  0.0  0.0   4448   268 ?        Ss   00:05   0:00 /bin/sh -c /usr/bin/flock /tmp/metrobills.lock /usr/bin/flock /tmp/metroevents.lock $APPDIR/scripts/lametro/full-scrape.sh >> /tmp/lametro.log
datamade 24012  0.0  0.0   5904    68 ?        S    00:05   0:00 /usr/bin/flock /tmp/metrobills.lock /usr/bin/flock /tmp/metroevents.lock /home/datamade/scrapers-us-municipal/scripts/lametro/full-scrape.sh

This would explain why there are no logs – old file locks are preventing the new scrapes from running. How did this happen? And is it safe to terminate them?

hancush commented 4 years ago

We use flock to manage locking files. Usually, we do this with the -n flag so commands fail, rather than block, if they can't acquire a lock.

However! It looks like the full scrape does not have this option:

https://github.com/datamade/scrapers-us-municipal/blob/27df297174b169127e808dfb94bd3ce5a650c981/scripts/la-metro-crontask#L16-L18

So, that might explain the build up of full scrapes. But why are the windowed scrapes hanging out???

More on flock: https://linux.die.net/man/1/flock

hancush commented 4 years ago

Oh, there's just the one fast full event scrape causing the bottleneck... So why is that hung????

hancush commented 4 years ago

Decided that it wasn't worth the time investment to investigate the bottleneck since we are moving away from this pattern for Metro. Instead, I killed each of the blocking processes with sudo kill -9, then (superstitiously) cleared the locks with flock -u /tmp/${LOCKNAME}.lock echo 'dropped (-u is unlock).

Hark! The next regularly scheduled event scrape ran and logged as expected.