acheong08 / insider

Financial data scraper for US governmental figures
15 stars 1 forks source link

Crawler and PDF parsing for Congress #1

Open acheong08 opened 1 year ago

acheong08 commented 1 year ago

https://disclosures-clerk.house.gov/PublicDisclosure/FinancialDisclosure

and

https://disclosures-clerk.house.gov/public_disc/ptr-pdfs/<year>/<ID>.pdf

acheong08 commented 1 year ago

Congress is a mess and uses PDFs which are difficult to parse automatically. Wording is also not very specific and will require some basic NLP to extract information

acheong08 commented 1 year ago

C: financial-pdfs P: ptr-pdfs

acheong08 commented 1 year ago

Only periodic transaction reports are relevant

acheong08 commented 1 year ago

There are some annoying dependency issues. Instructions to fix are in the tooling folder