TUM-IDP-WS-20 / doc

0 stars 0 forks source link

Examine data and Find a tool to convert PDF to text #15

Closed farukcankaya closed 3 years ago

farukcankaya commented 3 years ago

Parent: #2

farukcankaya commented 3 years ago

Data examination

File Type File Count
pdf 18452
xlsx 61
txt 6186
docx 4308
csv 3
ini 1
xlsm 1
Icon 27

Analyzed with python: https://colab.research.google.com/drive/1GOM0JnsUtjxRvbb8X2JduLnEjptki057#scrollTo=ZLkRsX53qNBJ

Screenshot 2020-10-26 at 21 27 04

PDF Text Extraction Tools

Python packages to extract plain text from PDF files:

farukcankaya commented 3 years ago

We did simple benchmarking by couting number of words that are extracted to compare these tools. It seems that it would be best choice to use Tika or PDFMiner. Both tools gave similar results but others were very bad.