Open speedplane opened 9 years ago
I believe I tracked down the issue to commit f638784 (see inline comments). One solution would be to group the lines if (1) they have similar start or stop x positions (which we do now); or (2) if the lines intersect in horizontally (we don't do this yet).
The following change to LTTextLineHorizontal.find_neighbors
should fix the issue.
def find_neighbors(self, plane, ratio):
d = ratio*self.height
objs = plane.find((self.x0, self.y0-d, self.x1, self.y1+d))
return [obj for obj in objs
if (isinstance(obj, LTTextLineHorizontal) and
# Ensure they are vertically close
abs(obj.height-self.height) < d and (
# And that they have similar start or stop x positions
abs(obj.x0-self.x0) < d or
abs(obj.x1-self.x1) < d) or
# Or that they intersect eachother horizontally.
(obj.x0 < self.x0 and obj.x1 > self.x0) or
(obj.x0 > self.x0 and obj.x0 < self.x1))]
Pdfminer does not properly process the text on the page 16 of the following document: https://www.sugarsync.com/pf/D62078_93519013_6758203
The correct result would be:
Instead pdfminer results in this:
I believe the reason is because the tabbed paragraphs cause the lines to not be connected. Below is a screenshot of the pdf page with the problems: