JoshData / pdf-redactor

A general purpose PDF text-layer redaction tool for Python 2/3.
Creative Commons Zero v1.0 Universal
185 stars 61 forks source link

Ligatures #5

Open divergentdave opened 7 years ago

divergentdave commented 7 years ago

Many PDF authoring suites replace "fi", etc. with ~dipthong~ ligature characters or glyphs. This may require special handling, either in the library or in calling code to avoid false negatives.

JoshData commented 7 years ago

Ligatures (not diphthong).

In the PDFs I was testing on, I saw the "fi" ligature as an entry in a CMap table (= a single glyph) that was mapped to a sequence of two Unicode characters ("f" "i"). It wasn't apparent from the PDF spec that that was even possible. So the module is handling that case - it sees it as two characters.

I suppose it's also possible they might be encoded as precomposed Unicode characters. The way around that would be to apply Unicode NFKC normalization I think, which would expand out ligatures (and re-compose other characters).

divergentdave commented 7 years ago

Oops, yes, that.