PDF -> text conversion - Githubissues

hbillings commented 7 years ago

Most of the content in PDFs can come out of PDFs, and we can remove the PDFs altogether. There are a small number of pages that need to retain the PDFs (reports, mostly). See if there's an easy way to get this text out.

toolness commented 7 years ago

@hbillings do you want me to explore automated ways of doing this, or is PCLOB ok with doing this manually?

hbillings commented 7 years ago

@toolness I copied and pasted one of the reports, and that worked fine -- I just had to put the formatting back in. You think there might be an automated way to do that?

toolness commented 7 years ago

I'm not sure, honestly, but digging into this kind of thing is something I've always wanted to learn about. I'm happy to investigate if manual copy-paste is a big hassle.

toolness commented 7 years ago

Relevant research:

How I parse PDF files mentions an OSS tool called pdf2text.
Full-Text Indexing PDFs in Javascript explains how to use pdf.js in combination with lunr.js to do full-text indexing. This may be particularly useful if we don't have time to convert everything from PDF to HTML, but would like to at least make the PDFs show up in search results.

toolness commented 7 years ago

Hmm, so given our current budget and timeline, I'm not actually sure how feasible this is going to be, to do manually or automatically. A few of the really tiny PDFs are definitely worth doing this for--I'm thinking a handful of press releases, e.g.:

Clicking on the titles or "Read More" links just takes you to a tiny PDF with only a few paragraphs of content, which is easy to just convert to HTML and show like all the other press releases.

However, other PDFs are much larger and/or have complex formatting. For example, the Report on the Telephone Records Program Conducted under Section 215 is 238 pages! As nice as it'd be to have all that viewable as HTML, I'm not sure we have the time and budget left to do it. 😞

toolness commented 7 years ago

I've populated a bunch of the site in #42 without extracting any text from PDFs, because just extracting the data took me a long time. We can try extracting text from PDFs later if we still have time--for now I'm just trying to get the basic content to have parity with the legacy site.

toolness commented 7 years ago

Over the weekend (during personal time) I played around with the nodejs pdf2json module a bit, and it was pretty interesting--basically, every PDF file is a bit like a Sketch/Illustrator file, so all the text in it is really a bunch of text spans. Even a single line of text in a paragraph might be broken up into a number of these spans, all of which have their own absolute coordinates associated with them, which means that there's not actually any semantic information about the text available. We have to infer all of it through heuristics and stuff, which I guess is what the pdf2text module does (I didn't use this package myself).

That said, semantic information can be provided through additional metadata in a PDF file that essentially maps all the visual elements to a logical structure. This results in something called a Tagged PDF, and it's essentially what's required to make a PDF accessible. A tagged PDF would presumably be a lot easier to reliably convert to HTML, as far as I can tell, though unfortunately it doesn't seem like the PDFs in the PCLOB site are tagged (which, um, I think might be a 508 violation).

That said, I did notice that Adobe Acrobat Reader DC offers to automatically tag untagged PDFs if it detects a screen-reader present. I'm not sure how accurate this tagging is, but it might be a quick way to make the PDFs 508 compliant, as well as perhaps more easily convertable to HTML.

18F / pclob

PDF -> text conversion #8