Open beamalsky opened 5 years ago
Today I ran docker-compose down --volumes
, then followed the README through step 4 to get the site running.
Then I opened a new tab and ran the following:
cd hearings
workon hearings
export DATABASE_URL=postgresql://postgres:postgres@localhost:32001/hearings
pupa update us --fastmode
It ran for ~10 minutes and returned:
Traceback (most recent call last):
File "/Users/beamalsky/.virtualenvs/hearings/bin/pupa", line 11, in <module>
sys.exit(main())
File "/Users/beamalsky/.virtualenvs/hearings/lib/python3.7/site-packages/pupa/cli/__main__.py", line 68, in main
subcommands[args.subcommand].handle(args, other)
File "/Users/beamalsky/.virtualenvs/hearings/lib/python3.7/site-packages/pupa/cli/commands/update.py", line 260, in handle
return self.do_handle(args, other, juris)
File "/Users/beamalsky/.virtualenvs/hearings/lib/python3.7/site-packages/pupa/cli/commands/update.py", line 305, in do_handle
report['scrape'] = self.do_scrape(juris, args, scrapers)
File "/Users/beamalsky/.virtualenvs/hearings/lib/python3.7/site-packages/pupa/cli/commands/update.py", line 173, in do_scrape
report[scraper_name] = scraper.do_scrape(**scrape_args)
File "/Users/beamalsky/.virtualenvs/hearings/lib/python3.7/site-packages/pupa/scrape/base.py", line 111, in do_scrape
for obj in self.scrape(**kwargs) or []:
File "/Users/beamalsky/Desktop/committee-oversight/hearings/us/events.py", line 168, in scrape
self._house_docs(uniq)
File "/Users/beamalsky/Desktop/committee-oversight/hearings/us/events.py", line 219, in _house_docs
self._add_house_docs(event, hearing_xml)
File "/Users/beamalsky/Desktop/committee-oversight/hearings/us/events.py", line 281, in _add_house_docs
media_type=doc_mime_type)
File "/Users/beamalsky/.virtualenvs/hearings/lib/python3.7/site-packages/pupa/scrape/event.py", line 130, in add_document
media_type=media_type, on_duplicate=on_duplicate)
File "/Users/beamalsky/.virtualenvs/hearings/lib/python3.7/site-packages/pupa/scrape/base.py", line 299, in _add_associated_link
raise ScrapeValueError("Duplicate entry in '%s' - URL: '%s'" % (collection, url))
pupa.exceptions.ScrapeValueError: Duplicate entry in 'documents' - URL: 'https://www.govinfo.gov/content/pkg/CHRG-116hhrg37917/pdf/CHRG-116hhrg37917.pdf'
Looking into the pupa base code that's throwing the error, I tried setting on_duplicate
to ignore in line 282 of hearings/us/events.py
. (Investigate exactly what this means!)
It got further but returned a bunch of these pupa errors:
11:34:45 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "Commission on Security and Cooperation in Europe"}
11:34:45 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "Subcommittee on Domestic and International Monetary Policy, Trade, and Technology", "parent__identifiers__identifier": "HSWM00"}
11:34:45 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "Committee on Energy and Commerce", "parent__identifiers__identifier": "HSWM00"}
11:34:45 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "Subcommittee Committee on Ways and Means U.s. House of Representatives Joint With", "parent__identifiers__identifier": "HSIF00"}
11:34:46 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "Subcommittee on Gonverment Management, Information, and Technology", "parent__identifiers__identifier": "HSGO00"}
11:34:46 ERROR pupa: cannot resolve pseudo id to Organization: ~{"identifiers__identifier": "HLGW00"}
11:34:47 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "Subcommittee on Oversight and Investigations", "parent__identifiers__identifier": "HLIG00"}
11:34:47 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "Subcommittee on Energy, Natural Resources, and Infrastructure", "parent__identifiers__identifier": "SSFI00"}
11:34:48 ERROR pupa: cannot resolve pseudo id to Organization: ~{"identifiers__identifier": "SSAG00"}
11:34:48 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "Subcommittee on Readiness Meeting Jointly with Subcommittee on Seapower and Projection Forces", "parent__identifiers__identifier": "HSAS00"}
11:34:49 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "Subcommittee on Benefits and Subcommittee on Health", "parent__identifiers__identifier": "HSVR00"}
11:34:50 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "Subcommittee on the Efficiency and Effectiveness of Federal Programs and the Federal Workforce", "parent__identifiers__identifier": "SSGA00"}
11:34:51 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "Subcommittee on Intelligence, Emerging Threats and Capabilities", "parent__identifiers__identifier": "HSAS00"}
11:34:52 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "Subcommittee on the Constitution, Federalism, and Property Rights", "parent__identifiers__identifier": "SSJU00"}
11:34:52 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "Ad Hoc Subcommittee on Disaster Recovery", "parent__identifiers__identifier": "SSGA00"}
11:34:53 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "Oversight and Investigations Subcommittee Meeting Jointly with Terrorism and Unconventional Threats and Capabilities Subcommittee", "parent__identifiers__identifier": "HSAS00"}
11:34:55 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "Seapower and Expeditionary Forces Subcommittee Meeting Jointly with Air and Land Forces Subcommittee", "parent__identifiers__identifier": "HSAS00"}
11:34:55 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "Subcommittee on Securities and Insurance and Investment", "parent__identifiers__identifier": "SSBK00"}
11:34:56 ERROR pupa: cannot resolve pseudo
and then this final error:
Traceback (most recent call last):
File "/Users/beamalsky/.virtualenvs/hearings/bin/pupa", line 11, in <module>
sys.exit(main())
File "/Users/beamalsky/.virtualenvs/hearings/lib/python3.7/site-packages/pupa/cli/__main__.py", line 68, in main
subcommands[args.subcommand].handle(args, other)
File "/Users/beamalsky/.virtualenvs/hearings/lib/python3.7/site-packages/pupa/cli/commands/update.py", line 260, in handle
return self.do_handle(args, other, juris)
File "/Users/beamalsky/.virtualenvs/hearings/lib/python3.7/site-packages/pupa/cli/commands/update.py", line 307, in do_handle
report['import'] = self.do_import(juris, args)
File "/Users/beamalsky/.virtualenvs/hearings/lib/python3.7/site-packages/pupa/cli/commands/update.py", line 215, in do_import
report.update(event_importer.import_directory(datadir))
File "/Users/beamalsky/.virtualenvs/hearings/lib/python3.7/site-packages/pupa/importers/base.py", line 196, in import_directory
return self.import_data(json_stream())
File "/Users/beamalsky/.virtualenvs/hearings/lib/python3.7/site-packages/pupa/importers/base.py", line 233, in import_data
obj_id, what = self.import_item(data)
File "/Users/beamalsky/.virtualenvs/hearings/lib/python3.7/site-packages/pupa/importers/base.py", line 257, in import_item
obj = self.get_object(data)
File "/Users/beamalsky/.virtualenvs/hearings/lib/python3.7/site-packages/pupa/importers/events.py", line 55, in get_object
return self.model_class.objects.get(**spec)
File "/Users/beamalsky/.virtualenvs/hearings/lib/python3.7/site-packages/django/db/models/manager.py", line 82, in manager_method
return getattr(self.get_queryset(), name)(*args, **kwargs)
File "/Users/beamalsky/.virtualenvs/hearings/lib/python3.7/site-packages/django/db/models/query.py", line 403, in get
(self.model._meta.object_name, num)
opencivicdata.legislative.models.event.MultipleObjectsReturned: get() returned more than one Event -- it returned 2!
I was curious how the scraper would run on an empty database, so I ran:
docker-compose down --volumes
docker-compose up
# in a new tab
docker-compose run --rm app python manage.py migrate
and then in my hearings
tab:
export DATABASE_URL=postgresql://postgres:postgres@localhost:32001/hearings
pupa update us --fastmode
~12:30 start time, still running. Cut short to do some troubleshooting.
Another try: With the ignore
mod discussed above I ran docker-compose down --volumes
, then followed the README through step 4 to get the site running.
Then I opened a new tab and ran the following:
cd hearings
workon hearings
export DATABASE_URL=postgresql://postgres:postgres@localhost:32001/hearings
pupa --debug update us events --fastmode --rpm=0
Started ~4:20
Relevant: #146
From SOW:
Recommended reading from Forest: https://source.opennews.org/articles/sane-data-updates-are-harder-you-think/
This can be broken into smaller issues once we get into it.