As a user of Distiller, I want PDFs to be pre-processed so that we can see the PDF extraction results and compare against mis-matches or missing data in the audit files.
We have over 250k audits for 2019 alone. It would be problematic to run PDF processing tasks for all these files on cloud.gov, so this issue is about discussing strategies to bulk process PDFs on a separate machine (or several) and then uploading the results to the Distiller database (perhaps by way of S3).
Processing a random sample of 12 files takes about 50 seconds on my machine (3.2 GHz i7-6900K CPU) and takes up approximately 700 MB of RAM. Extrapolating, 250k files would take about 290 hours (12 days of processing) -- assuming we don't run out of RAM during the process. Likely we would want to divide-and-conquer (parallelize) this dataset. More processing time is undoubtedly required for historical audit data.
Proposed strategy:
Splitting up the dataset in N pieces (where N is the number of CPUs/machines available)
Each CPU or machine will run the analysis module script on its dataset and save the results to S3
Run Cloud.gov tasks to load in these S3 results to the database
Acceptance criteria
[ ] We have a way to bulk process PDF files
[ ] Documentation for above
Outstanding questions
How large is the historical dataset?
What resources do we have available?
Can we allocate EC2 machines on AWS?
If we were more certain about the validity of FAC audit data, we could skip audits with no findings
Implementation notes
Single process run to test analysis times:
time (for file in audits/*.pdf
do
python -m distiller.extraction.analyze $file --errors --csv $file.csv
done)
51.06s user 12.75s system 130% cpu 48.938 total
User story
As a user of Distiller, I want PDFs to be pre-processed so that we can see the PDF extraction results and compare against mis-matches or missing data in the audit files.
We have over 250k audits for 2019 alone. It would be problematic to run PDF processing tasks for all these files on cloud.gov, so this issue is about discussing strategies to bulk process PDFs on a separate machine (or several) and then uploading the results to the Distiller database (perhaps by way of S3).
Processing a random sample of 12 files takes about 50 seconds on my machine (3.2 GHz i7-6900K CPU) and takes up approximately 700 MB of RAM. Extrapolating, 250k files would take about 290 hours (12 days of processing) -- assuming we don't run out of RAM during the process. Likely we would want to divide-and-conquer (parallelize) this dataset. More processing time is undoubtedly required for historical audit data.
Proposed strategy:
Acceptance criteria
Outstanding questions
Implementation notes
Single process run to test analysis times: