ShayHill / docx2python

Extract docx headers, footers, (formatted) text, footnotes, endnotes, properties, and images.
https://docx2python.readthedocs.io/en/latest/
MIT License
163 stars 34 forks source link

Feature to work with Comments #52

Closed flyguy62n closed 7 months ago

flyguy62n commented 7 months ago

Using Docx2Python in a data discovery tool to find PII and other sensitive data. Reading comments in DOCX files would be fantastic. I've attached a sample file. For my part, it doesn't need to be fancy -- even returning comments and their responses as a list of strings would do the trick.

lorem-ipsum-1line-comments.docx

ShayHill commented 7 months ago

I like that idea. Will have a look next week. Should be doable.

Sent from my iPhone

On Mar 29, 2024, at 09:16, Randy Bartels @.***> wrote:



Using Docx2Python in a data discovery tool to find PII and other sensitive data. Reading comments in DOCX files would be fantastic. I've attached a sample file. For my part, it doesn't need to be fancy -- even returning comments and their responses as a list of strings would do the trick.

lorem-ipsum-1line-comments.docxhttps://github.com/ShayHill/docx2python/files/14805109/lorem-ipsum-1line-comments.docx

— Reply to this email directly, view it on GitHubhttps://github.com/ShayHill/docx2python/issues/52, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADAKIE2Y6KJS7PVCIFVN3YLY2VZVBAVCNFSM6AAAAABFOPP4NWVHI2DSMVQWIX3LMV43ASLTON2WKOZSGIYTKNBWGU4DANI. You are receiving this because you are subscribed to this thread.Message ID: @.***>

ShayHill commented 7 months ago

I've deployed 2.9.0 to test PyPI.

pars = docx2python("word_doc_with_comments.docx")
comments = pars.comments
pars.close()
pars.comments => [
    (reference text, author, date, comment text),
    ....
]

It seems to work, but I don't have a lot of docx files with comments to test. I'm going to look around for some old, ugly files full of comments to test before I upload to official PyPI.

flyguy62n commented 7 months ago

I probably have some files I can throw at it too. If I get a hance this weekend, I'll give it a shot and let you know. First part of next week at the latest.

ShayHill commented 7 months ago

Sounds great. Thank you.

Sent from my iPhone

On Mar 29, 2024, at 19:27, Randy Bartels @.***> wrote:



I probably have some files I can throw at it too. If I get a hance this weekend, I'll give it a shot and let you know. First part of next week at the latest.

— Reply to this email directly, view it on GitHubhttps://github.com/ShayHill/docx2python/issues/52#issuecomment-2027843069, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADAKIE6XTS74G6SW4AUWFT3Y2YBIPAVCNFSM6AAAAABFOPP4NWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRXHA2DGMBWHE. You are receiving this because you commented.Message ID: @.***>

flyguy62n commented 7 months ago

Finally got to this today. A couple of notes:

  1. Using Python 3.12.2 on Windows

  2. On the attached file, the actual hyperlink in the comment points to www.gooogle.com, but pars.comments notes it as href="styles.xml". Pretty minor, but if there's some magic to put the real href in there.... test_file_with_comments.docx

  3. On the second file I tried I got a traceback. Unfortunately, I can't upload it as it has real customer data in it. To draw some contrast:

If there's other info I can provide about the offending file, just let me know.

comments traceback:

>>> comments=pars.comments
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\docx_output.py", line 266, in comments
    depth_collector = office_document.depth_collector
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\docx_reader.py", line 237, in depth_collector
    self.__depth_collector = self.__depth_collector or new_depth_collector(self)
                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\docx_text.py", line 384, in new_depth_collector
    branches(root)
  File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\docx_text.py", line 380, in branches
    branches(branch)
  File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\docx_text.py", line 380, in branches
    branches(branch)
  File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\docx_text.py", line 380, in branches
    branches(branch)
  [Previous line repeated 1 more time]
  File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\docx_text.py", line 376, in branches
    recurse_into_tree = tag_runner.open(tree)
                        ^^^^^^^^^^^^^^^^^^^^^
  File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\docx_text.py", line 139, in open
    return method(tree)
           ^^^^^^^^^^^^
  File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\docx_text.py", line 183, in _open_comment_range_start
    self.tables.start_comment_range(tree.attrib[qn("w:id")])
  File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\depth_collector.py", line 127, in start_comment_range
    cruns = self._count_runs()
            ^^^^^^^^^^^^^^^^^^
  File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\depth_collector.py", line 120, in _count_runs
    return len(list(self._runs_so_far))
               ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\depth_collector.py", line 108, in _runs_so_far
    assert len(self.open_pars) == 1
AssertionError

text=pars.text traceback:

>>> text=pars.text
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\docx_output.py", line 212, in text
    return flatten_text(self.document_runs, do_pStyle)
                        ^^^^^^^^^^^^^^^^^^
  File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\docx_output.py", line 199, in document_runs
    + self.body_runs
      ^^^^^^^^^^^^^^
  File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\docx_output.py", line 155, in body_runs
    return self.officeDocument_runs
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\docx_output.py", line 145, in officeDocument_runs
    return self._get_runs("officeDocument")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\docx_output.py", line 120, in _get_runs
    content += file.content
               ^^^^^^^^^^^^
  File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\docx_reader.py", line 246, in content
    return self.get_content()
           ^^^^^^^^^^^^^^^^^^
  File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\docx_reader.py", line 259, in get_content
    return cast(TablesList, self.depth_collector.tree)
                            ^^^^^^^^^^^^^^^^^^^^
  File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\docx_reader.py", line 237, in depth_collector
    self.__depth_collector = self.__depth_collector or new_depth_collector(self)
                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\docx_text.py", line 384, in new_depth_collector
    branches(root)
  File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\docx_text.py", line 380, in branches
    branches(branch)
  File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\docx_text.py", line 380, in branches
    branches(branch)
  File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\docx_text.py", line 380, in branches
    branches(branch)
  [Previous line repeated 1 more time]
  File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\docx_text.py", line 376, in branches
    recurse_into_tree = tag_runner.open(tree)
                        ^^^^^^^^^^^^^^^^^^^^^
  File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\docx_text.py", line 139, in open
    return method(tree)
           ^^^^^^^^^^^^
  File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\docx_text.py", line 183, in _open_comment_range_start
    self.tables.start_comment_range(tree.attrib[qn("w:id")])
  File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\depth_collector.py", line 127, in start_comment_range
    cruns = self._count_runs()
            ^^^^^^^^^^^^^^^^^^
  File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\depth_collector.py", line 120, in _count_runs
    return len(list(self._runs_so_far))
               ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\depth_collector.py", line 108, in _runs_so_far
    assert len(self.open_pars) == 1
AssertionError
ShayHill commented 7 months ago

Both of these should be fixed now. Please let me know if you see the same failures with 2.9.2

flyguy62n commented 7 months ago

That did it. Both text and comments properly parsed in 2.9.2. Thanks!

ShayHill commented 7 months ago

Glad to hear it. I'm moving this to PyPI main.