Adds `ftfy` to `DictionaryWordPredictor` to fix unicode oddities.

soldni commented 2 years ago

This PR adds ftfy to DictionaryWordPredictor to mitigate some issues in character parsing from pdfplumber. In short, it calls ftfy.fix_text to replace corrupted or low frequencies characters such as ligatures (e.g. Veriﬁcation, where ﬁ is a single character) with more common representations.

rauthur commented 1 year ago

When you consider the underlying character stream from the PDF document then it's interesting to note that ﬁ probably corresponds to a single character in that stream. To be pedantic, I am not sure a parsing with ligatures is an issue with the underlying parser.

Introducing this into a predictor rather than into a parser may cause unexpected alignment issues if I really care about aligning characters to the original document. Examples of this may be when I want to use different x/y tolerances for detecting words (though this doesn't seem a possible concern with current mmda, once I parse I am done with tolerances as far as the library is concerned).

Does this have consequences for referencing symbols on a Document which is just a very long string? I'm not sure!

The overall point may be moot since mmda may not care about any of the above. Or, in practice it may not be an issue b/c other things are not consuming the output of the word predictor.

Could a different set up be something like:

PDF -> parse -> fix tokens as a global step -> ... -> predictors -> done

Fixing ligatures is not so much a model as a detect and replace configuration. The dictionary word predictor is a predictor from one perspective because it can be "trained" on the PDF of concern to have a local dictionary and capture words like TensorFlow appropriately.

Another thought on global token fixing is simply that most(?) models are likely to work better with Verification than Veriﬁcation. Perhaps most incorporate ftfy as an input processing step but in theory that duplicate work could be reduced.

kyleclo commented 1 year ago

@rauthur

Introducing this into a predictor rather than into a parser may cause unexpected alignment issues if I really care about aligning characters to the original document.

Yes, but that's already the case with WordPredictor. A sequence of tokens 'lang', '-', 'uage' mapping to the string 'language' already has to deal with the issue that the number of characters between the word's .text representation differs from .symbols of the original tokens.

Does this have consequences for referencing symbols on a Document which is just a very long string?

I'm starting to think .symbols isn't particularly useful & there's maybe some benefit to removing it. If one wants to obtain a <str> representation of SpanGroup, the ideal way is through calling .text, which either has a default behavior of stitching .symbols, or overrides with its own saved <str>, like is done with Words.

At a higher-level, this seems to boil down to -- what are tokens vs words, as defined in this library. When we started the project, I wanted tokens to be a (mostly) without-subjectivity store of atomic PDF units as-is. So any undesirability that's in the PDF should also be in tokens (e.g. hyphenation of words, ligatures, weird invisible tokens, etc.). words was meant to be the unit that we would continually refine; it's meant to be the answer to - If a human were asked about the literal content of the paper, what the human would type out.

Fixing ligatures, in spirit, is doing the same thing the DictWordPredictor is trying to do -- that is, create words that are useful. It's also so minor that it seems like overkill to have an entire separate Predictor just to handle ligatures & have to manage alignment between ligature_transformed_words and dict_word_predicted_words. So on one hand, I can see a reason to bundle all these things up into a single WordPredictor that ensembles all of our best know-hows to produce words. Users would also like this as it's a lot easier.

Maybe what we should do is rename DictWordPredictor to just a generic WordPredictor. Implementation-wise, it would have separate internal methods for handling the Dict-aspect of forming words, as well as the ligature-transformation. In the long-run, I'm thinking more and more this is a task for an efficient PostEditingModel that scans PDF tokens & outputs proposed edits to form words.

thoughts?

rauthur commented 1 year ago

That's a good point.

I think the key difference with "-" is that one is removing characters and in theory can still index into symbols using all Span start/end. So, the individual character indices in symbols can still line up for spans. There is no longer a span that includes the "-". De-hyphenation compresses a span or discards symbols from the Document as not useful to meaning.

To your point, if we are keeping SpanGroup.(whatever_method_reaches_doc_symbols) -> just the original characters then everything seems OK. My original comment forgot that we tend to build up new items as we go (symbols then we have tokens then "words" are potentially entirely separate).

Seems OK to skip renaming, etc. for now.

allenai / mmda

Adds `ftfy` to `DictionaryWordPredictor` to fix unicode oddities. #149