CenterForOpenScience / pydocx

An extendable docx file format parser and converter
Other
186 stars 55 forks source link

Error when reading doc file #253

Closed Freakwill closed 5 years ago

Freakwill commented 5 years ago

I read doc file with following code, but got error. It works well with docx file. How do i fix it?

with open('path/to/file.doc', 'rb') as fp:
    for result in PyDocXExporter(fp).export():
        print(result)
    text = ''.join(result for result in PyDocXExporter(fp).export())
    print(text)

Traceback: for result in PyDocXExporter(fp).export(): File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pydocx/export/base.py", line 110, in export document = self.main_document_part.document File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pydocx/openxml/packaging/main_document_part.py", line 49, in document self._document = self.load_document() File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pydocx/openxml/packaging/main_document_part.py", line 53, in load_document self._document = Document.load(self.root_element, container=self) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pydocx/models.py", line 280, in load other=element.tag, pydocx.models.XmlRootElementMismatchException: Expected root element document but got themeManager instead

jlward commented 5 years ago

This library does not work with doc files. It only works with docx files. Specifically MS Word 2007 XML and newer versions of docx.