Kylel/2022 10/hotfix spacing in pdfplumber symbols

Before, due to tokenization from PDF Plumber that would split on every punctuation, we were observing Documents contained symbols that looked like:

>> doc.symbols:
References Mitchell P . Marcus , Beatrice Santorini , and Mary Ann Marcinkiewicz . 1993 .

>> doc.tokens:
References
Mitchell
P
.
Marcus
,
Beatrice
Santorini
,
and
Mary
Ann
Marcinkiewicz
.
1993
.

This is because the way PDFPlumberParser constructs .symbols to include in Document was previously using either whitespace " " or newline "\n" concatenation of the strings in each detected token.

What we actually want is to keep this fine-grained partitioning of tokens, but to still have .symbols better preserve the (lack of) whitespacing that exists in the original document, when appropriate. Ideally:

>> doc.symbols:
References Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993.

>> doc.tokens:
References
Mitchell
P
.
Marcus
,
Beatrice
Santorini
,
and
Mary
Ann
Marcinkiewicz
.
1993
.

This fix is a reimplementation of the logic of how PDFPlumberParser pulls information out of PDFPlumber objects for stitching together. The implementation is a bit tricky to explain, but see the new tests to get a sense of functionality.

allenai / mmda

Kylel/2022 10/hotfix spacing in pdfplumber symbols #166