HazyResearch / pdftotree

:evergreen_tree: A tool for converting PDF into hOCR with text, tables, and figures being recognized and preserved.
MIT License
430 stars 90 forks source link

Convert CIDs to Unicode #33

Open lukehsiao opened 6 years ago

lukehsiao commented 6 years ago

PDFMiner gives us a bunch of CID characters in our output. (e.g. 25(cid:176) C instead of 25° C). It would be great to be able to convert these to their respective unicode characters before outputting. Some potentially useful references [1], [2].

[1] https://stackoverflow.com/questions/24089245/decode-cid-font-codes-to-equivalent-ascii-characters [2] https://github.com/adobe-type-tools/cmap-resources/

Update: looking into it, it seems that PDFMiner actually tries to take care of this, but due to poorly created PDFs that don't include all of the necessary information, they cannot always convert to unicode. https://github.com/pdfminer/pdfminer.six/issues/35.

lukehsiao commented 6 years ago

It may also be that we are not using pdfminer to the fullest. For example, (cid:176) appears in the glyphlist [1]. Which makes me wonder why this actually appears in our output.

This makes parsing problematic, since we have a single set of coordinates output for the "character", but then (cid:%d) as a string is passed along, which is interpreted as a string.

https://github.com/pdfminer/pdfminer.six/blob/master/pdfminer/converter.py#L127

lukehsiao commented 6 years ago

Right now we're just replacing the cid using regex in the Fonduer parser to a wildcard character ($, at the moment). In an ideal world, we could fix this here, though.