Open divergentdave opened 8 years ago
Ligatures (not diphthong).
In the PDFs I was testing on, I saw the "fi" ligature as an entry in a CMap table (= a single glyph) that was mapped to a sequence of two Unicode characters ("f" "i"). It wasn't apparent from the PDF spec that that was even possible. So the module is handling that case - it sees it as two characters.
I suppose it's also possible they might be encoded as precomposed Unicode characters. The way around that would be to apply Unicode NFKC normalization I think, which would expand out ligatures (and re-compose other characters).
Oops, yes, that.
Many PDF authoring suites replace "fi", etc. with ~dipthong~ ligature characters or glyphs. This may require special handling, either in the library or in calling code to avoid false negatives.