GSA / openacr

OpenACR is a digital native Accessibility Conformance Report (ACR). The initial development is based on Section 508 requirements. The main goal is to be able to compare the accessibility claims of digital products and services. A structured, self-validated, machine-readable documentation will provide for this.
https://gsa.github.io/openacr/
Other
89 stars 18 forks source link

Evaluate PDF to OpenACR format conversion #210

Open farooqzakhilwal opened 2 years ago

grugnog commented 2 years ago

Did an initial test extracting tables from PDFs using both Tabula (via the CLI) and AWS Textract. Both seem like viable options, but I need to write some simple Python loops to run across the library of VPAT PDFs and save to CSV, then do some filtering/munging of the CSVs and count the number of rows. We can then compare that to the expected number of rows in a VPAT as an initial assessment of quality. We can further dig in by adding heuristics that can split the rows into the appropriate chapters. If that looks good we would pick the tool that is working best and wrap it in a REST API that accepts a POSTed file or URL and responds with the parsed OpenACR data that we can load into the UI.

grugnog commented 2 years ago

I have CSVs for each PDF using both Tabula and Textract methods, still working on filtering/munging and evaluating the CSVs so we can pick the best tool.

dmundra commented 2 years ago

Both @grugnog and I are busy with other priorities but after next sprint we should be freed up a little bit.