farukcankaya commented 4 years ago

Parent: #2

[x] Examine the data that we have whether they have all words in text-format or in scanned-image format.
[x] Create an environment to process the data. on LRZ, Google Colab, local computer etc.
[x] Find a tool to convert PDF to text

farukcankaya commented 4 years ago

Data examination

File type distribution: find . -not -type d > files.txt files.txt

File Type	File Count
pdf	18452
xlsx	61
txt	6186
docx	4308
csv	3
ini	1
xlsm	1
Icon	27

Analyzed with python: https://colab.research.google.com/drive/1GOM0JnsUtjxRvbb8X2JduLnEjptki057#scrollTo=ZLkRsX53qNBJ

Some pdf files are already converted to .docx files and then to .txt files. i.e.: 3_JAR

PDF Text Extraction Tools

Python packages to extract plain text from PDF files:

Tika: https://tika.apache.org/
PyPDF2: https://pypi.org/project/PyPDF2/
Pdfplumber: https://github.com/jsvine/pdfplumber
PDFminer: https://github.com/pdfminer/pdfminer.six

farukcankaya commented 4 years ago

We did simple benchmarking by couting number of words that are extracted to compare these tools. It seems that it would be best choice to use Tika or PDFMiner. Both tools gave similar results but others were very bad.

Comparison result is accessible in here and downloadable by: PDF Text Extraction Tools Comparison.xlsx
PDF Text Extraction code is accessible from here.

TUM-IDP-WS-20 / doc

Examine data and Find a tool to convert PDF to text #15

Parent: #2

Data examination

PDF Text Extraction Tools