CivicSpleen / ambry

A comprehensive data package manager
BSD 2-Clause "Simplified" License
4 stars 5 forks source link

Implement reading HTML tables as a source #120

Closed nmb10 closed 8 years ago

nmb10 commented 8 years ago

run:

python load_pre10.py ../../pre-10-bundles/converted/ssa.gov-oasdi_trustee_report/ssa.gov-oasdi_trustee_report.yaml

error: Starting import ../../pre-10-bundles/converted/ssa.gov-oasdi_trustee_report/ssa.gov-oasdi_trustee_report.yaml... Loading bundle: ssa.gov-oasdi_trustee_report-0.0.2~d03e002 INFO ssa.gov-oasdi_trustee_report ---- Synchronized ---- Starting ingest... INFO ssa.gov-oasdi_trustee_report Ingesting: IVB1 from http://www.ssa.gov/oact/tr/2014/lr4b1.html Traceback (most recent call last): File "load_pre10.py", line 252, in main() File "load_pre10.py", line 220, in main _ingest(b) File "load_pre10.py", line 177, in _ingest b.ingest(force=force, clean_files=clean_files) File "/home/nmb10/projects/ambry_project/ambry/bundle/bundle.py", line 873, in ingest clean=force, account_accessor=account_accessor) File "/home/nmb10/.virtualenvs/ambry/local/lib/python2.7/site-packages/ambry_sources/download.py", line 76, in get_source .format(spec.name, file_type)) ambry_sources.sources.exceptions.SourceError: Failed to determine file type for source 'IVB1'; unknown type 'html'

ericbusboom commented 8 years ago

This bundle has a bundle.py file that re-implementes loading sources, wth specific support for reading HTML tables. We'll need to implement this in ambry_sources.

The bundle uses pandas.read_html() to read the table data.