euske / pdfminer

Python PDF Parser (Not actively maintained). Check out pdfminer.six.
https://github.com/pdfminer/pdfminer.six
MIT License
5.24k stars 1.13k forks source link

Bad Layout Analysis for Text with Tabbed Paragraphs #82

Open speedplane opened 9 years ago

speedplane commented 9 years ago

Pdfminer does not properly process the text on the page 16 of the following document: https://www.sugarsync.com/pf/D62078_93519013_6758203

The correct result would be:

CBM2013-00025
US 7,856,430 B1
16
III. CONCLUSION
We conclude Petitioner has proven, by a preponderance of the
evidence, that claims 1–3, 5–7, 9–11, and 13–15 of the ’403 patent are
unpatentable under 35 U.S.C. § 101; and, Patent Owner has not shown it is
entitled to exclude Dr. Freedman’s Declaration, Ex. 1012.
IV. ORDER
For the reasons given, it is hereby:
ORDERED that Petitioner has established by a preponderance of the
evidence that claims 1–3, 5–7, 9–11, and 13–15 of the ’403 patent are
unpatentable;
FURTHER ORDERED that Patent Owner’s Motion to Exclude is
denied;
FURTHER ORDERED that because this is a Final Written Decision,
parties to the proceeding seeking judicial review of the Decision must
comply with the notice and service requirements of 37 C.F.R. § 90.2.

Instead pdfminer results in this:

CBM2013-00025 
US 7,856,430 B1 

III. CONCLUSION 

evidence, that claims 1–3, 5–7, 9–11, and 13–15 of the ’403 patent are 
unpatentable under 35 U.S.C. § 101; and, Patent Owner has not shown it is 
entitled to exclude Dr. Freedman’s Declaration, Ex. 1012. 
IV. ORDER 

evidence that claims 1–3, 5–7, 9–11, and 13–15 of the ’403 patent are 
unpatentable; 

denied; 

parties to the proceeding seeking judicial review of the Decision must 
comply with the notice and service requirements of 37 C.F.R. § 90.2. 

For the reasons given, it is hereby: 
ORDERED that Petitioner has established by a preponderance of the 
FURTHER ORDERED that Patent Owner’s Motion to Exclude is 
FURTHER ORDERED that because this is a Final Written Decision, 

16

I believe the reason is because the tabbed paragraphs cause the lines to not be connected. Below is a screenshot of the pdf page with the problems:

image

speedplane commented 9 years ago

I believe I tracked down the issue to commit f638784 (see inline comments). One solution would be to group the lines if (1) they have similar start or stop x positions (which we do now); or (2) if the lines intersect in horizontally (we don't do this yet).

The following change to LTTextLineHorizontal.find_neighbors should fix the issue.

def find_neighbors(self, plane, ratio):
        d = ratio*self.height
        objs = plane.find((self.x0, self.y0-d, self.x1, self.y1+d))
        return [obj for obj in objs
                if (isinstance(obj, LTTextLineHorizontal) and
                    # Ensure they are vertically close
                    abs(obj.height-self.height) < d and (
                    # And that they have similar start or stop x positions
                    abs(obj.x0-self.x0) < d or
                    abs(obj.x1-self.x1) < d) or 
                    # Or that they intersect eachother horizontally.
                    (obj.x0 < self.x0 and obj.x1 > self.x0) or
                    (obj.x0 > self.x0 and obj.x0 < self.x1))]