ShayHill / docx2python

Extract docx headers, footers, (formatted) text, footnotes, endnotes, properties, and images.
https://docx2python.readthedocs.io/en/latest/
MIT License
154 stars 34 forks source link

Feature Request: Add support for "strict" format #62

Closed Spectre5 closed 1 month ago

Spectre5 commented 2 months ago

It would be great if the strict format was supported for docx/docm files. I think it basically just requires different ns tags to be used. Here are the tags used in a similar project, mammoth. It does not include as many tags as in docx2python though, so I'm not totally sure what the strict format tags are for some of the other tags used in this libraries file namespace.py.

If you have a file that includes all of the tags for this library, then you could save it in strict format to see what those tags become.

ShayHill commented 2 months ago

What would the advantage be in regards to text extraction?

Sent from my iPhone

On Jun 19, 2024, at 12:32, Spectre5 @.***> wrote:



It would be great if the strict format was supported for docx/docm files. I think it basically just requires different ns tags to be used. Here are the tags used in a similar project, mammothhttps://github.com/mwilliamson/python-mammoth/blob/master/mammoth/docx/office_xml.py. It does not include as many tags as in docx2python though, so I'm not totally sure what the strict format tags are for some of the other tags used in this libraries file namespace.py.

If you have a file that includes all of the tags for this library, then you could save it in strict format to see what those tags become.

— Reply to this email directly, view it on GitHubhttps://github.com/ShayHill/docx2python/issues/62, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADAKIEZ7Y4Z4WRIJZRLZP6TZIG6CZAVCNFSM6AAAAABJSOXHKCVHI2DSMVQWIX3LMV43ASLTON2WKOZSGM3DEOBWGI3TKNQ. You are receiving this because you are subscribed to this thread.Message ID: @.***>

Spectre5 commented 2 months ago

Well right now the library cannot extract text from a strict docx file. We have some automatically created docx files that are saved in the strict format that I was hoping to parse the text of.

ShayHill commented 2 months ago

I will have a look around. Thank you.

Sent from my iPhone

On Jun 19, 2024, at 14:36, Spectre5 @.***> wrote:

strict

ShayHill commented 2 months ago

I took a look at this. Currently, docx2python v2 explicitly defines namespaces. This is a legacy of docx2python v1, which used the xml module from the standard library. The way to handle strict and other surprises should be to load the namespaces from the input documents and dynamically create tags. I want to do this, but fear it might break some projects out there, so I am going to plan this for docx2python 3, which I might create over the next few weekends.

Spectre5 commented 2 months ago

I agree that would be the best way to handle it. For what it's worth, that is what pylightxl does for .xlsx/.xlsm files, if you want some inspiration.

ShayHill commented 2 months ago

I uploaded a branch that should work with strict docx files.

https://github.com/ShayHill/docx2python/tree/v3

If you try it, please let me know if you have any files that don't work. I will release this on pypi when I make a few other v3 updates.

Spectre5 commented 1 month ago

I haven't had a chance to try it yet - but will try to soon.

ShayHill commented 1 month ago

Version 3.0.0 is not up on PyPI. It should work with strict Word files.