Closed farukcankaya closed 4 years ago
find . -not -type d > files.txt
files.txtFile Type | File Count |
---|---|
18452 | |
xlsx | 61 |
txt | 6186 |
docx | 4308 |
csv | 3 |
ini | 1 |
xlsm | 1 |
Icon | 27 |
Analyzed with python: https://colab.research.google.com/drive/1GOM0JnsUtjxRvbb8X2JduLnEjptki057#scrollTo=ZLkRsX53qNBJ
Python packages to extract plain text from PDF files:
We did simple benchmarking by couting number of words that are extracted to compare these tools. It seems that it would be best choice to use Tika or PDFMiner. Both tools gave similar results but others were very bad.
Parent: #2