This gets the remaining fields 5 through 14 in the financial disclosure form and outputs them to csvs.
Closes #20
Notes
Occasionally, there were value errors when looking for fields that get pushed onto the second page. This was resolved by passing a table_settings param to the extract_table() method to conjoin adjacent tables.
Turns out, the extract method used in parse_pdf only extracts the largest table present on the page. Also a table is considered to be a rectangle with 2 or more cells. If there were fields on the last page that made up a smaller table than the signature table, or if said table was just one cell, they would be ignored. Setting the intersection_tolerance visually joins the tables on the last page to ensure they're always gotten regardless of size or content.
The signature section was also occasionally considered to be a part of field 14, so a conditional was added to _is_section() to ensure that the signature is always its own section.
Testing Instructions
Run make data/processed/filing_status.csv
Ensure that the command runs without error and that the csvs created have expected values
You can optionally reduce the amount of files it scrapes and get results sooner. Change the scraper's _filers method to have a count that increments in the else clause and breaks if it iterates 5 or so times
Overview
This gets the remaining fields 5 through 14 in the financial disclosure form and outputs them to csvs.
Notes
Occasionally, there were value errors when looking for fields that get pushed onto the second page. This was resolved by passing a
table_settings
param to theextract_table()
method to conjoin adjacent tables.Turns out, the extract method used in
parse_pdf
only extracts the largest table present on the page. Also a table is considered to be a rectangle with 2 or more cells. If there were fields on the last page that made up a smaller table than the signature table, or if said table was just one cell, they would be ignored. Setting theintersection_tolerance
visually joins the tables on the last page to ensure they're always gotten regardless of size or content.The signature section was also occasionally considered to be a part of field 14, so a conditional was added to
_is_section()
to ensure that the signature is always its own section.Testing Instructions
make data/processed/filing_status.csv
_filers
method to have a count that increments in the else clause and breaks if it iterates 5 or so times