clowder-framework / extractors-s2orc-pdf2text

Extractor to convert pdf to text
Apache License 2.0
1 stars 0 forks source link

Check out pdfminer #14

Closed minump closed 1 year ago

minump commented 1 year ago

Pdfminer is a python package to convert pdf to text and other formats. Pdfminer is not actively maintained. There's a similar package name pdfminer.six that is community maintained. Check out https://pypi.org/project/pdfminer/, https://github.com/euske/pdfminer, https://github.com/pdfminer/pdfminer.six, https://pdfminersix.readthedocs.io/en/latest/index.html

minump commented 1 year ago

pdfminer.six and pdfminer gives the same result for conversion to text for "arp4655.pdf". There are several texts in "Reference" section that is missing. Attached is the text output. arp4655.txt

minump commented 1 year ago

Both grobid and pdfminer was compared for several pdf files and results shared in Box folder (RCT-Transparency > pdf manuscripts > grobid-pdf-extractor and pdfminer-pdf-extractor) on June 6th. After discussions, neither one proved better than the other.. However, only a small sample of pdf files were tested.

Decided to stick with Grobid.

Closing this issue.