deanmalmgren / textract

extract text from any document. no muss. no fuss.
http://textract.readthedocs.io
MIT License
3.89k stars 599 forks source link

Text is processed out of order with pdfminer #345

Open samayer12 opened 4 years ago

samayer12 commented 4 years ago

Description

With Complex_1.pdf as the source document, numbered paragraph 44 is improperly processed as 44.

Paragraph 44's text is found after 45.

To Reproduce

  1. textract.process('Complex_1.pdf', method='pdfminer').decode()
  2. Examine output for string 44. (My output.)
  3. Observe mismatched processing.

Expected Output

[...]

44. 

PFAS are a class of chemicals encompassing more than 5,000 unique substances.  

45. 

Scientific research demonstrates that members of the class of PFAS can have 

[...]

Desktop:

Additional Info

Possibly better-suited for this fork of pdfminer?