Open beamalsky opened 4 years ago
Added the on_duplicate='ignore'
flag to event.add_document
(line 279 of events.py
) and changed the default start_time
:
if start_time is None:
start_time = datetime.datetime(2019, 11, 1, 0, 0, tzinfo=pytz.utc)
From committee-oversight
repo, following README there:
docker-compose down --volumes
docker-compose -f docker-compose.yml -f docker-compose.db-ops.yml run -e PGPASSWORD=postgres --rm dbload-dump
docker-compose -f docker-compose.yml -f docker-compose.db-ops.yml run --rm dbload-fixtures
docker-compose up
from this hearings
repo:
workon hearings
export DATABASE_URL=postgresql://postgres:postgres@localhost:32001/hearings
pupa --debug update us events --fastmode --rpm=0
Results:
us (scrape, import)
events: {}
events scrape:
duration: 0:02:09.494687
objects:
event: 664
jurisdiction scrape:
duration: 0:00:00.004053
objects:
jurisdiction: 1
organization: 3
import:
event: 648 new 12 updated 4 noop
jurisdiction: 0 new 0 updated 1 noop
organization: 0 new 0 updated 3 noop
Committees do not seem to be assigned properly—maybe not at all? Investigate.
Trying again...
docker-compose down --volumes
docker-compose -f docker-compose.yml -f docker-compose.db-ops.yml run -e PGPASSWORD=postgres --rm dbload-dump
docker-compose -f docker-compose.yml -f docker-compose.db-ops.yml run --rm dbload-fixtures
docker-compose up
workon hearings
export DATABASE_URL=postgresql://postgres:postgres@localhost:32001/hearings
pupa --debug update us events --fastmode --rpm=0
Results:
opencivicdata.legislative.models.event.Event.MultipleObjectsReturned: get() returned more than one Event -- it returned 2!
> /Users/beamalsky/.virtualenvs/hearings/lib/python3.7/site-packages/django/db/models/query.py(403)get()
-> (self.model._meta.object_name, num)
(Pdb) u
> /Users/beamalsky/.virtualenvs/hearings/lib/python3.7/site-packages/django/db/models/manager.py(82)manager_method()
-> return getattr(self.get_queryset(), name)(*args, **kwargs)
(Pdb) u
> /Users/beamalsky/.virtualenvs/hearings/lib/python3.7/site-packages/pupa/importers/events.py(55)get_object()
-> return self.model_class.objects.get(**spec)
(Pdb) print(spec)
{'name': 'TONGASS TRANSFER AND TRANSITION ACT', 'description': '', 'start_date': '1996-07-03', 'end_date': '', 'jurisdiction_id': 'ocd-jurisdiction/country:us/legislature'}
(Pdb)
Looking at the database, we already have 2 entries for this hearing:
This is good and what we want to see! The importer is looking for an existing hearing to update, finding two, and failing because it's actually ambiguous. We don't want duplicate hearings in the database so these should be identified and fixed.
Next step: write a SQL query to identify these duplicates.
SELECT name, start_date, count(*)
FROM opencivicdata_event
GROUP BY name, start_date
HAVING COUNT(*) > 1;
This gives us 161 events with duplicate name
and start_dates
. From going through this a bit, some of them look like mistaken duplicates (they have the same document links and different sources, implying that they were entered through the web form after being scraped) and some are different events with the same name
and start_date
but different doc links, like the vaguely named Judicial Nomination hearings. Seems like these might be multiple parts of the same event, as in the case of the previous links.
In any case, I made a spreadsheet to share with Jamie to see if she's up for going through and reconciling these on the staging site:
docker exec -u postgres committeeoversight-postgres psql -d hearings -c "COPY (SELECT name, start_date, count(*) FROM opencivicdata_event GROUP BY name, start_date HAVING COUNT(*) > 1) TO STDOUT WITH CSV" > duplicate_hearings.csv
A continuation of https://github.com/datamade/hearings/issues/11
From SOW: