datamade / committee-oversight

⚖️ Committee oversight map coding project for the Lugar Center
https://oversight-index.thelugarcenter.org/
MIT License
0 stars 0 forks source link

Set up nightly scraping #122

Open beamalsky opened 5 years ago

beamalsky commented 5 years ago

From SOW:

1.7. Daily Scraping From the previous phase of work, DataMade has built a data pipeline to build the database of hearings. DataMade will set up this pipeline to run daily so hearings that are added or modified on govinfo.gov or the House of Representatives’ Document Repository will be added to the Center’s database. Care must be taken to not integrate, not override, data entered by Lugar staff.

44 hours | $6,600 USD

Recommended reading from Forest: https://source.opennews.org/articles/sane-data-updates-are-harder-you-think/

This can be broken into smaller issues once we get into it.

beamalsky commented 5 years ago

Today I ran docker-compose down --volumes, then followed the README through step 4 to get the site running.

Then I opened a new tab and ran the following:

cd hearings
workon hearings
export DATABASE_URL=postgresql://postgres:postgres@localhost:32001/hearings
pupa update us --fastmode 

It ran for ~10 minutes and returned:

Traceback (most recent call last):
  File "/Users/beamalsky/.virtualenvs/hearings/bin/pupa", line 11, in <module>
    sys.exit(main())
  File "/Users/beamalsky/.virtualenvs/hearings/lib/python3.7/site-packages/pupa/cli/__main__.py", line 68, in main
    subcommands[args.subcommand].handle(args, other)
  File "/Users/beamalsky/.virtualenvs/hearings/lib/python3.7/site-packages/pupa/cli/commands/update.py", line 260, in handle
    return self.do_handle(args, other, juris)
  File "/Users/beamalsky/.virtualenvs/hearings/lib/python3.7/site-packages/pupa/cli/commands/update.py", line 305, in do_handle
    report['scrape'] = self.do_scrape(juris, args, scrapers)
  File "/Users/beamalsky/.virtualenvs/hearings/lib/python3.7/site-packages/pupa/cli/commands/update.py", line 173, in do_scrape
    report[scraper_name] = scraper.do_scrape(**scrape_args)
  File "/Users/beamalsky/.virtualenvs/hearings/lib/python3.7/site-packages/pupa/scrape/base.py", line 111, in do_scrape
    for obj in self.scrape(**kwargs) or []:
  File "/Users/beamalsky/Desktop/committee-oversight/hearings/us/events.py", line 168, in scrape
    self._house_docs(uniq)
  File "/Users/beamalsky/Desktop/committee-oversight/hearings/us/events.py", line 219, in _house_docs
    self._add_house_docs(event, hearing_xml)
  File "/Users/beamalsky/Desktop/committee-oversight/hearings/us/events.py", line 281, in _add_house_docs
    media_type=doc_mime_type)
  File "/Users/beamalsky/.virtualenvs/hearings/lib/python3.7/site-packages/pupa/scrape/event.py", line 130, in add_document
    media_type=media_type, on_duplicate=on_duplicate)
  File "/Users/beamalsky/.virtualenvs/hearings/lib/python3.7/site-packages/pupa/scrape/base.py", line 299, in _add_associated_link
    raise ScrapeValueError("Duplicate entry in '%s' - URL: '%s'" % (collection, url))
pupa.exceptions.ScrapeValueError: Duplicate entry in 'documents' - URL: 'https://www.govinfo.gov/content/pkg/CHRG-116hhrg37917/pdf/CHRG-116hhrg37917.pdf'
beamalsky commented 5 years ago

Looking into the pupa base code that's throwing the error, I tried setting on_duplicate to ignore in line 282 of hearings/us/events.py. (Investigate exactly what this means!)

It got further but returned a bunch of these pupa errors:

11:34:45 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "Commission on Security and Cooperation in Europe"}
11:34:45 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "Subcommittee on Domestic and International Monetary Policy, Trade, and Technology", "parent__identifiers__identifier": "HSWM00"}
11:34:45 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "Committee on Energy and Commerce", "parent__identifiers__identifier": "HSWM00"}
11:34:45 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "Subcommittee Committee on Ways and Means U.s. House of Representatives Joint With", "parent__identifiers__identifier": "HSIF00"}
11:34:46 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "Subcommittee on Gonverment Management, Information, and Technology", "parent__identifiers__identifier": "HSGO00"}
11:34:46 ERROR pupa: cannot resolve pseudo id to Organization: ~{"identifiers__identifier": "HLGW00"}
11:34:47 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "Subcommittee on Oversight and Investigations", "parent__identifiers__identifier": "HLIG00"}
11:34:47 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "Subcommittee on Energy, Natural Resources, and Infrastructure", "parent__identifiers__identifier": "SSFI00"}
11:34:48 ERROR pupa: cannot resolve pseudo id to Organization: ~{"identifiers__identifier": "SSAG00"}
11:34:48 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "Subcommittee on Readiness Meeting Jointly with Subcommittee on Seapower and Projection Forces", "parent__identifiers__identifier": "HSAS00"}
11:34:49 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "Subcommittee on Benefits and Subcommittee on Health", "parent__identifiers__identifier": "HSVR00"}
11:34:50 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "Subcommittee on the Efficiency and Effectiveness of Federal Programs and the Federal Workforce", "parent__identifiers__identifier": "SSGA00"}
11:34:51 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "Subcommittee on Intelligence, Emerging Threats and Capabilities", "parent__identifiers__identifier": "HSAS00"}
11:34:52 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "Subcommittee on the Constitution, Federalism, and Property Rights", "parent__identifiers__identifier": "SSJU00"}
11:34:52 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "Ad Hoc Subcommittee on Disaster Recovery", "parent__identifiers__identifier": "SSGA00"}
11:34:53 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "Oversight and Investigations Subcommittee Meeting Jointly with Terrorism and Unconventional Threats and Capabilities Subcommittee", "parent__identifiers__identifier": "HSAS00"}
11:34:55 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "Seapower and Expeditionary Forces Subcommittee Meeting Jointly with Air and Land Forces Subcommittee", "parent__identifiers__identifier": "HSAS00"}
11:34:55 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "Subcommittee on Securities and Insurance and Investment", "parent__identifiers__identifier": "SSBK00"}
11:34:56 ERROR pupa: cannot resolve pseudo

and then this final error:

Traceback (most recent call last):
  File "/Users/beamalsky/.virtualenvs/hearings/bin/pupa", line 11, in <module>
    sys.exit(main())
  File "/Users/beamalsky/.virtualenvs/hearings/lib/python3.7/site-packages/pupa/cli/__main__.py", line 68, in main
    subcommands[args.subcommand].handle(args, other)
  File "/Users/beamalsky/.virtualenvs/hearings/lib/python3.7/site-packages/pupa/cli/commands/update.py", line 260, in handle
    return self.do_handle(args, other, juris)
  File "/Users/beamalsky/.virtualenvs/hearings/lib/python3.7/site-packages/pupa/cli/commands/update.py", line 307, in do_handle
    report['import'] = self.do_import(juris, args)
  File "/Users/beamalsky/.virtualenvs/hearings/lib/python3.7/site-packages/pupa/cli/commands/update.py", line 215, in do_import
    report.update(event_importer.import_directory(datadir))
  File "/Users/beamalsky/.virtualenvs/hearings/lib/python3.7/site-packages/pupa/importers/base.py", line 196, in import_directory
    return self.import_data(json_stream())
  File "/Users/beamalsky/.virtualenvs/hearings/lib/python3.7/site-packages/pupa/importers/base.py", line 233, in import_data
    obj_id, what = self.import_item(data)
  File "/Users/beamalsky/.virtualenvs/hearings/lib/python3.7/site-packages/pupa/importers/base.py", line 257, in import_item
    obj = self.get_object(data)
  File "/Users/beamalsky/.virtualenvs/hearings/lib/python3.7/site-packages/pupa/importers/events.py", line 55, in get_object
    return self.model_class.objects.get(**spec)
  File "/Users/beamalsky/.virtualenvs/hearings/lib/python3.7/site-packages/django/db/models/manager.py", line 82, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/Users/beamalsky/.virtualenvs/hearings/lib/python3.7/site-packages/django/db/models/query.py", line 403, in get
    (self.model._meta.object_name, num)
opencivicdata.legislative.models.event.MultipleObjectsReturned: get() returned more than one Event -- it returned 2!
beamalsky commented 5 years ago

I was curious how the scraper would run on an empty database, so I ran:

docker-compose down --volumes
docker-compose up
# in a new tab
docker-compose run --rm app python manage.py migrate 

and then in my hearings tab:

export DATABASE_URL=postgresql://postgres:postgres@localhost:32001/hearings
pupa update us --fastmode 

~12:30 start time, still running. Cut short to do some troubleshooting.

beamalsky commented 5 years ago

Another try: With the ignore mod discussed above I ran docker-compose down --volumes, then followed the README through step 4 to get the site running.

Then I opened a new tab and ran the following:

cd hearings
workon hearings
export DATABASE_URL=postgresql://postgres:postgres@localhost:32001/hearings
pupa --debug update us events --fastmode --rpm=0

Started ~4:20

beamalsky commented 4 years ago

Relevant: #146