clowder-framework / extractors-s2orc-pdf2text

Extractor to convert pdf to text
Apache License 2.0
1 stars 0 forks source link

Convert word docx files to txt format #10

Closed minump closed 1 day ago

minump commented 1 year ago

Convert word docx files to flat-xml format (single xml file instead of multiple xml files). Input the converted xml file to grobid s2orc which converts to txt.

minump commented 1 year ago

Check out https://github.com/microsoft/Simplify-Docx

minump commented 1 year ago

Some docx files (files that started as doc and then opened as docx and having weird formats) are not fully readble by python docx module. Error as below

File "/extractors-s2orc-pdf2text/venv/lib/python3.9/site-packages/simplify_docx/elements/base.py", line 36, in __init__
    self.props[prop] = getattr(x, prop)
AttributeError: 'lxml.etree._Element' object has no attribute 'fldCharType'
minump commented 1 day ago

Closing this issue. For the Consort project, the word input files are converted to pdf files for further processing. The below extractor is used for this purpose. https://github.com/clowder-framework/extractors-soffice