Currently the PDF parser emits elements in the order that they are read from the PDF textContent() array. This can be drastically different from how the text elements appear on the screen. For instance, Betterment PDF page numbers always appear at the beginning of a given page's text content even though visually they appear at the bottom of the page when rendered.
Perhaps it is better to first collect all text elements, sort by x-offset, merge the elements with a common x-offset and then emit these joined elements in x-offset offset order.
Currently the PDF parser emits elements in the order that they are read from the PDF textContent() array. This can be drastically different from how the text elements appear on the screen. For instance, Betterment PDF page numbers always appear at the beginning of a given page's text content even though visually they appear at the bottom of the page when rendered.
Perhaps it is better to first collect all text elements, sort by x-offset, merge the elements with a common x-offset and then emit these joined elements in x-offset offset order.