Should PDF parser sort text elements by x-offset?

fcfort commented 8 years ago

Currently the PDF parser emits elements in the order that they are read from the PDF textContent() array. This can be drastically different from how the text elements appear on the screen. For instance, Betterment PDF page numbers always appear at the beginning of a given page's text content even though visually they appear at the bottom of the page when rendered.

Perhaps it is better to first collect all text elements, sort by x-offset, merge the elements with a common x-offset and then emit these joined elements in x-offset offset order.

fcfort commented 8 years ago

Blocked by #21

fcfort commented 7 years ago

The answer is yes. Raised #41 in response to this question.

fcfort / betterment-csv-chrome

Should PDF parser sort text elements by x-offset? #15