Before, due to tokenization from PDF Plumber that would split on every punctuation, we were observing Documents contained symbols that looked like:
>> doc.symbols:
References Mitchell P . Marcus , Beatrice Santorini , and Mary Ann Marcinkiewicz . 1993 .
>> doc.tokens:
References
Mitchell
P
.
Marcus
,
Beatrice
Santorini
,
and
Mary
Ann
Marcinkiewicz
.
1993
.
This is because the way PDFPlumberParser constructs .symbols to include in Document was previously using either whitespace " " or newline "\n" concatenation of the strings in each detected token.
What we actually want is to keep this fine-grained partitioning of tokens, but to still have .symbols better preserve the (lack of) whitespacing that exists in the original document, when appropriate. Ideally:
>> doc.symbols:
References Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993.
>> doc.tokens:
References
Mitchell
P
.
Marcus
,
Beatrice
Santorini
,
and
Mary
Ann
Marcinkiewicz
.
1993
.
This fix is a reimplementation of the logic of how PDFPlumberParser pulls information out of PDFPlumber objects for stitching together. The implementation is a bit tricky to explain, but see the new tests to get a sense of functionality.
@geli-gel good call on DictionaryWordPredictor. ill take a look at that PR, but i dont want this one to have dependency on that predictor, as this is more core-functionality
Before, due to tokenization from PDF Plumber that would split on every punctuation, we were observing Documents contained symbols that looked like:
This is because the way PDFPlumberParser constructs
.symbols
to include in Document was previously using either whitespace" "
or newline"\n"
concatenation of the strings in each detected token.What we actually want is to keep this fine-grained partitioning of tokens, but to still have
.symbols
better preserve the (lack of) whitespacing that exists in the original document, when appropriate. Ideally:This fix is a reimplementation of the logic of how
PDFPlumberParser
pulls information out ofPDFPlumber
objects for stitching together. The implementation is a bit tricky to explain, but see the new tests to get a sense of functionality.