18F / FAC-Distiller

Federal Audit Clearing House Distiller
2 stars 4 forks source link

Bulk PDF processing #100

Open cantsin opened 4 years ago

cantsin commented 4 years ago

User story

As a user of Distiller, I want PDFs to be pre-processed so that we can see the PDF extraction results and compare against mis-matches or missing data in the audit files.

We have over 250k audits for 2019 alone. It would be problematic to run PDF processing tasks for all these files on cloud.gov, so this issue is about discussing strategies to bulk process PDFs on a separate machine (or several) and then uploading the results to the Distiller database (perhaps by way of S3).

Processing a random sample of 12 files takes about 50 seconds on my machine (3.2 GHz i7-6900K CPU) and takes up approximately 700 MB of RAM. Extrapolating, 250k files would take about 290 hours (12 days of processing) -- assuming we don't run out of RAM during the process. Likely we would want to divide-and-conquer (parallelize) this dataset. More processing time is undoubtedly required for historical audit data.

Proposed strategy:

Acceptance criteria

Outstanding questions

Implementation notes

Single process run to test analysis times:

time (for file in audits/*.pdf
do
  python -m distiller.extraction.analyze $file --errors --csv $file.csv
done)
51.06s user 12.75s system 130% cpu 48.938 total
ls -l audits/*.pdf | wc -l
12
bpdesigns commented 4 years ago

@danielnaab to break this down into manageable chucks I propose:

  1. processing the pdfs from 2019 that have 3 findings or under from NSF, ED, DOT
  2. then all pdfs from audits with findings from NSF, ED, DOT
  3. then ALL agencies with 3 findings or under
  4. then ALL pdfs with findings
  5. then ALL pdfs