Closed minump closed 1 day ago
Some docx files (files that started as doc and then opened as docx and having weird formats) are not fully readble by python docx module. Error as below
File "/extractors-s2orc-pdf2text/venv/lib/python3.9/site-packages/simplify_docx/elements/base.py", line 36, in __init__
self.props[prop] = getattr(x, prop)
AttributeError: 'lxml.etree._Element' object has no attribute 'fldCharType'
Closing this issue. For the Consort project, the word input files are converted to pdf files for further processing. The below extractor is used for this purpose. https://github.com/clowder-framework/extractors-soffice
Convert word docx files to flat-xml format (single xml file instead of multiple xml files). Input the converted xml file to grobid s2orc which converts to txt.