allenai / mmda

multimodal document analysis
Apache License 2.0
158 stars 18 forks source link

Kylel/2022 10/hotfix spacing in pdfplumber symbols #166

Closed kyleclo closed 1 year ago

kyleclo commented 2 years ago

Before, due to tokenization from PDF Plumber that would split on every punctuation, we were observing Documents contained symbols that looked like:

>> doc.symbols:
References Mitchell P . Marcus , Beatrice Santorini , and Mary Ann Marcinkiewicz . 1993 .

>> doc.tokens:
References
Mitchell
P
.
Marcus
,
Beatrice
Santorini
,
and
Mary
Ann
Marcinkiewicz
.
1993
.

This is because the way PDFPlumberParser constructs .symbols to include in Document was previously using either whitespace " " or newline "\n" concatenation of the strings in each detected token.

What we actually want is to keep this fine-grained partitioning of tokens, but to still have .symbols better preserve the (lack of) whitespacing that exists in the original document, when appropriate. Ideally:

>> doc.symbols:
References Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993.

>> doc.tokens:
References
Mitchell
P
.
Marcus
,
Beatrice
Santorini
,
and
Mary
Ann
Marcinkiewicz
.
1993
.

This fix is a reimplementation of the logic of how PDFPlumberParser pulls information out of PDFPlumber objects for stitching together. The implementation is a bit tricky to explain, but see the new tests to get a sense of functionality.

kyleclo commented 1 year ago

@geli-gel good call on DictionaryWordPredictor. ill take a look at that PR, but i dont want this one to have dependency on that predictor, as this is more core-functionality