datamade / hearings

How do Congressional Hearings Change?
MIT License
1 stars 0 forks source link

Nightly scrape setup #13

Open beamalsky opened 4 years ago

beamalsky commented 4 years ago

A continuation of https://github.com/datamade/hearings/issues/11

From SOW:

1.7. Daily Scraping From the previous phase of work, DataMade has built a data pipeline to build the database of hearings. DataMade will set up this pipeline to run daily so hearings that are added or modified on govinfo.gov or the House of Representatives’ Document Repository will be added to the Center’s database. Care must be taken to not integrate, not override, data entered by Lugar staff.

44 hours | $6,600 USD

beamalsky commented 4 years ago

Added the on_duplicate='ignore' flag to event.add_document (line 279 of events.py) and changed the default start_time:

if start_time is None:
            start_time = datetime.datetime(2019, 11, 1, 0, 0, tzinfo=pytz.utc)

From committee-oversight repo, following README there:

docker-compose down --volumes
docker-compose -f docker-compose.yml -f docker-compose.db-ops.yml run -e PGPASSWORD=postgres --rm dbload-dump
docker-compose -f docker-compose.yml -f docker-compose.db-ops.yml run --rm dbload-fixtures
docker-compose up

from this hearings repo:

workon hearings
export DATABASE_URL=postgresql://postgres:postgres@localhost:32001/hearings
pupa --debug update us events --fastmode --rpm=0

Results:

us (scrape, import)
  events: {}
events scrape:
  duration:  0:02:09.494687
  objects:
    event: 664
jurisdiction scrape:
  duration:  0:00:00.004053
  objects:
    jurisdiction: 1
    organization: 3
import:
  event: 648 new 12 updated 4 noop
  jurisdiction: 0 new 0 updated 1 noop
  organization: 0 new 0 updated 3 noop

Committees do not seem to be assigned properly—maybe not at all? Investigate.

beamalsky commented 4 years ago

Trying again...

docker-compose down --volumes
docker-compose -f docker-compose.yml -f docker-compose.db-ops.yml run -e PGPASSWORD=postgres --rm dbload-dump
docker-compose -f docker-compose.yml -f docker-compose.db-ops.yml run --rm dbload-fixtures
docker-compose up
workon hearings
export DATABASE_URL=postgresql://postgres:postgres@localhost:32001/hearings
pupa --debug update us events --fastmode --rpm=0

Results:

opencivicdata.legislative.models.event.Event.MultipleObjectsReturned: get() returned more than one Event -- it returned 2!
> /Users/beamalsky/.virtualenvs/hearings/lib/python3.7/site-packages/django/db/models/query.py(403)get()
-> (self.model._meta.object_name, num)
(Pdb) u
> /Users/beamalsky/.virtualenvs/hearings/lib/python3.7/site-packages/django/db/models/manager.py(82)manager_method()
-> return getattr(self.get_queryset(), name)(*args, **kwargs)
(Pdb) u
> /Users/beamalsky/.virtualenvs/hearings/lib/python3.7/site-packages/pupa/importers/events.py(55)get_object()
-> return self.model_class.objects.get(**spec)
(Pdb) print(spec)
{'name': 'TONGASS TRANSFER AND TRANSITION ACT', 'description': '', 'start_date': '1996-07-03', 'end_date': '', 'jurisdiction_id': 'ocd-jurisdiction/country:us/legislature'}
(Pdb)

Looking at the database, we already have 2 entries for this hearing:

  1. http://localhost:8000/hearing/view/ocd-event/1e16a75a-21c5-46fc-949c-23218badba0f/ entered through the web form
  2. http://localhost:8000/hearing/view/ocd-event/b8f0bf60-9050-4616-843a-6fbe04ca2e8f/ with Web and API sources
beamalsky commented 4 years ago

This is good and what we want to see! The importer is looking for an existing hearing to update, finding two, and failing because it's actually ambiguous. We don't want duplicate hearings in the database so these should be identified and fixed.

Next step: write a SQL query to identify these duplicates.

SELECT name, start_date, count(*)
FROM opencivicdata_event
GROUP BY name, start_date
HAVING COUNT(*) > 1;

This gives us 161 events with duplicate name and start_dates. From going through this a bit, some of them look like mistaken duplicates (they have the same document links and different sources, implying that they were entered through the web form after being scraped) and some are different events with the same name and start_date but different doc links, like the vaguely named Judicial Nomination hearings. Seems like these might be multiple parts of the same event, as in the case of the previous links.

In any case, I made a spreadsheet to share with Jamie to see if she's up for going through and reconciling these on the staging site:

docker exec -u postgres committeeoversight-postgres psql -d hearings -c "COPY (SELECT name, start_date, count(*) FROM opencivicdata_event GROUP BY name, start_date HAVING COUNT(*) > 1) TO STDOUT WITH CSV" > duplicate_hearings.csv