Closed kyleclo closed 1 year ago
LGTM.
Will this impact the ability of
utils.stringify()
to make strings from words when the words are annotated onto the OG document?
it won't. the new document is an object internal to the word predictor, and the key is that it reuses the original document's .symbols
. since all things are stored as span-groups, which basically points to start-end positions, as long as doc.symbols
remains the same, then there's no difference between words based on spans on the new doc vs old doc.
an alternative implementation would've been to modify the existing doc with something like a hidden field, e.g. doc.annotate(_new_tokens)
and perform the operations wrt to those to derive words
, then do doc.remove(_new_tokens)
as a final step.
i think there's no difference between the two approaches, but keeping things separate was a bit easier for debugging
the error here is in the word predictor's
get_features()
.at a high-level, the idea of the word predictor is that it loops over each hyphen in a candidate word and predicts whether it should be kept or removed. So a word like
wizard-of-oz
would trigger two predictions, one for each hyphen. The way the model works is it looks at a window surrounding that word to derive features that help with prediction. So forwizard-of-oz
, the first hyphen's featurization iswizard, of
and the second hyphen's featurization isof, oz
.the problem we're seeing is one due to tokenization. some PDFs result in weird tokenizations where the
token.text.strip() == ""
. I took a look into this, and it's not a bug introduced in ourPDFPlumberTokenizer
, but an upstream issue with just... pdfs in general I guess. When these empty tokens are introduced, it causes situations in which you have words that end with hyphenation. For example, what if there were an empty token betweenwizard-of-
andoz
.there are several ways to fix this. one is to fix tokenization, but we'd have to re-drive PDFs, which is too costly at the moment. another is to refactor & retrain the model to know how to properly handle even empty tokenization cases. this is also too costly right now.
so what I've done here is a bit hacky but it's essentially -- add a cleaning step at the beginning of the word predictor's
.predict()
in which we create a new Document that we operate over. as we catch weird issues with tokenization in the future, we can add to this function. all word prediction happens on top of this new document. at the end, we return words as basically start-end spans, so the removal of tokens shouldn't affect how we performdocument.annotate(words=words)
on the original, uncleaned document.