MESH-Research / knowledge-commons-works

The next-generation research repository for the Knowledge Commons (formerly Humanities Commons)
https://hcommons.org
MIT License
4 stars 0 forks source link

Recommendations and sample code for docx and ppt #293

Closed monotasker closed 9 months ago

koutiany commented 9 months ago

3 libraries for .docx:

  1. python-docx: most commonly used one, footnote not included
  2. docx2txt: no footnotes, but can get text out of imagine without any extra function written
  3. mammoth: switch from docx to html, then parse it using beautiful soup; output could be a bit messy but footnote is for sure included.

Full updates see .md file on Github

koutiany commented 9 months ago

selected one library that fits all-- 'python-pptx', modified it by adding 'tesseract OCR' and 'timeit' Full updates see .md file on GitHub