Open farooqzakhilwal opened 2 years ago
I have CSVs for each PDF using both Tabula and Textract methods, still working on filtering/munging and evaluating the CSVs so we can pick the best tool.
Both @grugnog and I are busy with other priorities but after next sprint we should be freed up a little bit.
Did an initial test extracting tables from PDFs using both Tabula (via the CLI) and AWS Textract. Both seem like viable options, but I need to write some simple Python loops to run across the library of VPAT PDFs and save to CSV, then do some filtering/munging of the CSVs and count the number of rows. We can then compare that to the expected number of rows in a VPAT as an initial assessment of quality. We can further dig in by adding heuristics that can split the rows into the appropriate chapters. If that looks good we would pick the tool that is working best and wrap it in a REST API that accepts a POSTed file or URL and responds with the parsed OpenACR data that we can load into the UI.