Open soldni opened 2 years ago
When you consider the underlying character stream from the PDF document then it's interesting to note that fi
probably corresponds to a single character in that stream. To be pedantic, I am not sure a parsing with ligatures is an issue with the underlying parser.
Introducing this into a predictor rather than into a parser may cause unexpected alignment issues if I really care about aligning characters to the original document. Examples of this may be when I want to use different x/y tolerances for detecting words (though this doesn't seem a possible concern with current mmda
, once I parse I am done with tolerances as far as the library is concerned).
Does this have consequences for referencing symbols
on a Document
which is just a very long string? I'm not sure!
The overall point may be moot since mmda
may not care about any of the above. Or, in practice it may not be an issue b/c other things are not consuming the output of the word predictor.
Could a different set up be something like:
PDF -> parse -> fix tokens as a global step -> ... -> predictors -> done
Fixing ligatures is not so much a model as a detect and replace configuration. The dictionary word predictor is a predictor from one perspective because it can be "trained" on the PDF of concern to have a local dictionary and capture words like TensorFlow
appropriately.
Another thought on global token fixing is simply that most(?) models are likely to work better with Verification
than Verification
. Perhaps most incorporate ftfy
as an input processing step but in theory that duplicate work could be reduced.
@rauthur
Introducing this into a predictor rather than into a parser may cause unexpected alignment issues if I really care about aligning characters to the original document.
Yes, but that's already the case with WordPredictor. A sequence of tokens 'lang', '-', 'uage'
mapping to the string 'language'
already has to deal with the issue that the number of characters between the word's .text
representation differs from .symbols
of the original tokens.
Does this have consequences for referencing symbols on a Document which is just a very long string?
I'm starting to think .symbols
isn't particularly useful & there's maybe some benefit to removing it. If one wants to obtain a <str>
representation of SpanGroup, the ideal way is through calling .text
, which either has a default behavior of stitching .symbols
, or overrides with its own saved <str>
, like is done with Words
.
At a higher-level, this seems to boil down to -- what are tokens
vs words
, as defined in this library. When we started the project, I wanted tokens
to be a (mostly) without-subjectivity store of atomic PDF units as-is. So any undesirability that's in the PDF should also be in tokens
(e.g. hyphenation of words, ligatures, weird invisible tokens, etc.). words
was meant to be the unit that we would continually refine; it's meant to be the answer to - If a human were asked about the literal content of the paper, what the human would type out.
Fixing ligatures, in spirit, is doing the same thing the DictWordPredictor is trying to do -- that is, create words
that are useful. It's also so minor that it seems like overkill to have an entire separate Predictor
just to handle ligatures & have to manage alignment between ligature_transformed_words
and dict_word_predicted_words
. So on one hand, I can see a reason to bundle all these things up into a single WordPredictor
that ensembles all of our best know-hows to produce words. Users would also like this as it's a lot easier.
Maybe what we should do is rename DictWordPredictor to just a generic WordPredictor. Implementation-wise, it would have separate internal methods for handling the Dict-aspect of forming words, as well as the ligature-transformation. In the long-run, I'm thinking more and more this is a task for an efficient PostEditingModel that scans PDF tokens & outputs proposed edits to form words.
thoughts?
That's a good point.
I think the key difference with "-" is that one is removing characters and in theory can still index into symbols using all Span start/end. So, the individual character indices in symbols can still line up for spans. There is no longer a span that includes the "-". De-hyphenation compresses a span or discards symbols from the Document as not useful to meaning.
To your point, if we are keeping SpanGroup.(whatever_method_reaches_doc_symbols) -> just the original characters then everything seems OK. My original comment forgot that we tend to build up new items as we go (symbols then we have tokens then "words" are potentially entirely separate).
Seems OK to skip renaming, etc. for now.
This PR adds
ftfy
toDictionaryWordPredictor
to mitigate some issues in character parsing from pdfplumber. In short, it callsftfy.fix_text
to replace corrupted or low frequencies characters such as ligatures (e.g.Verification
, wherefi
is a single character) with more common representations.