ShayHill / docx2python

Extract docx headers, footers, (formatted) text, footnotes, endnotes, properties, and images.
https://docx2python.readthedocs.io/en/latest/
MIT License
157 stars 35 forks source link

run slower than 2020-11-16 #33

Closed shm007g closed 2 years ago

shm007g commented 2 years ago

I run the new version of docx2python for my files, it run slower than last version I use.

I post my record here. I cost far too more times than the last version I use.


Machine: macmini 2018

test files


New Version (Need Python3.7)

>>> docx2xml ./files/文23储气库地面工程可行性研究(0)-2019.8.8.docx 200 cost 10.90s
>>> docx2xml ./files/文23储气库地面工程可行性研究(0)-2019.8.8.docx 200 cost 11.36s
>>> docx2xml ./files/文23储气库地面工程可行性研究(0)-2019.8.8.docx 200 cost 11.75s
>>> docx2xml ./files/文23储气库地面工程可行性研究(0)-2019.8.8.docx 200 cost 11.89s
>>> docx2xml ./files/文23储气库地面工程可行性研究(0)-2019.8.8.docx 200 cost 11.18s
>>> docx2xml ./files/BD19456-吉兰泰油田产能地面建设及配套工程可研报告A版-总说明书.docx 200 cost 12.39s
>>> docx2xml ./files/BD19456-吉兰泰油田产能地面建设及配套工程可研报告A版-总说明书.docx 200 cost 11.80s
>>> docx2xml ./files/BD19456-吉兰泰油田产能地面建设及配套工程可研报告A版-总说明书.docx 200 cost 11.78s
>>> docx2xml ./files/BD19456-吉兰泰油田产能地面建设及配套工程可研报告A版-总说明书.docx 200 cost 11.75s
>>> docx2xml ./files/BD19456-吉兰泰油田产能地面建设及配套工程可研报告A版-总说明书.docx 200 cost 12.57s

Old Version

>>> docx2xml ./files/文23储气库地面工程可行性研究(0)-2019.8.8.docx 200 cost 4.26s
>>> docx2xml ./files/文23储气库地面工程可行性研究(0)-2019.8.8.docx 200 cost 4.34s
>>> docx2xml ./files/文23储气库地面工程可行性研究(0)-2019.8.8.docx 200 cost 4.22s
>>> docx2xml ./files/文23储气库地面工程可行性研究(0)-2019.8.8.docx 200 cost 4.32s
>>> docx2xml ./files/文23储气库地面工程可行性研究(0)-2019.8.8.docx 200 cost 4.28s
>>> docx2xml ./files/BD19456-吉兰泰油田产能地面建设及配套工程可研报告A版-总说明书.docx 200 cost 4.72s
>>> docx2xml ./files/BD19456-吉兰泰油田产能地面建设及配套工程可研报告A版-总说明书.docx 200 cost 5.31s
>>> docx2xml ./files/BD19456-吉兰泰油田产能地面建设及配套工程可研报告A版-总说明书.docx 200 cost 4.43s
>>> docx2xml ./files/BD19456-吉兰泰油田产能地面建设及配套工程可研报告A版-总说明书.docx 200 cost 5.00s
>>> docx2xml ./files/BD19456-吉兰泰油田产能地面建设及配套工程可研报告A版-总说明书.docx 200 cost 4.92s

ShayHill commented 2 years ago

Thank you for the report.

This is due to new features that require stitching docx runs together. Inside the document, Word breaks up words based on spellcheck tests, revision times, etc., so a paragraph might look like:

<w:p><w:r>Tw</w:r>o <w:r>w</w:r><w:r>or</w:r><w:r>d</w:r><w:r>s</w:r></w:p>

Recent versions of Docx2Python update this to

<w:p><w:r>Two words</w:r></w:p>

This allows for a lot of code simplification and also for newer features like text replacement. BUT, it does slow things down.

If you're looking for a fast, simple export, python-docx2text might suit your needs.

Thank you again.

-Shay