Closed flyguy62n closed 7 months ago
I like that idea. Will have a look next week. Should be doable.
Sent from my iPhone
On Mar 29, 2024, at 09:16, Randy Bartels @.***> wrote:
Using Docx2Python in a data discovery tool to find PII and other sensitive data. Reading comments in DOCX files would be fantastic. I've attached a sample file. For my part, it doesn't need to be fancy -- even returning comments and their responses as a list of strings would do the trick.
lorem-ipsum-1line-comments.docxhttps://github.com/ShayHill/docx2python/files/14805109/lorem-ipsum-1line-comments.docx
— Reply to this email directly, view it on GitHubhttps://github.com/ShayHill/docx2python/issues/52, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADAKIE2Y6KJS7PVCIFVN3YLY2VZVBAVCNFSM6AAAAABFOPP4NWVHI2DSMVQWIX3LMV43ASLTON2WKOZSGIYTKNBWGU4DANI. You are receiving this because you are subscribed to this thread.Message ID: @.***>
I've deployed 2.9.0 to test PyPI.
pars = docx2python("word_doc_with_comments.docx")
comments = pars.comments
pars.close()
pars.comments => [
(reference text, author, date, comment text),
....
]
It seems to work, but I don't have a lot of docx files with comments to test. I'm going to look around for some old, ugly files full of comments to test before I upload to official PyPI.
I probably have some files I can throw at it too. If I get a hance this weekend, I'll give it a shot and let you know. First part of next week at the latest.
Sounds great. Thank you.
Sent from my iPhone
On Mar 29, 2024, at 19:27, Randy Bartels @.***> wrote:
I probably have some files I can throw at it too. If I get a hance this weekend, I'll give it a shot and let you know. First part of next week at the latest.
— Reply to this email directly, view it on GitHubhttps://github.com/ShayHill/docx2python/issues/52#issuecomment-2027843069, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADAKIE6XTS74G6SW4AUWFT3Y2YBIPAVCNFSM6AAAAABFOPP4NWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRXHA2DGMBWHE. You are receiving this because you commented.Message ID: @.***>
Finally got to this today. A couple of notes:
Using Python 3.12.2 on Windows
On the attached file, the actual hyperlink in the comment points to www.gooogle.com, but pars.comments
notes it as href="styles.xml"
. Pretty minor, but if there's some magic to put the real href
in there....
test_file_with_comments.docx
On the second file I tried I got a traceback. Unfortunately, I can't upload it as it has real customer data in it. To draw some contrast:
text
using v2.8.0, but not with 2.9.1.text
from this file using 2.9.1 also generated an exception.comment
and text
from similar files created from the same template using 2.9.1.If there's other info I can provide about the offending file, just let me know.
comments traceback:
>>> comments=pars.comments
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\docx_output.py", line 266, in comments
depth_collector = office_document.depth_collector
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\docx_reader.py", line 237, in depth_collector
self.__depth_collector = self.__depth_collector or new_depth_collector(self)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\docx_text.py", line 384, in new_depth_collector
branches(root)
File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\docx_text.py", line 380, in branches
branches(branch)
File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\docx_text.py", line 380, in branches
branches(branch)
File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\docx_text.py", line 380, in branches
branches(branch)
[Previous line repeated 1 more time]
File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\docx_text.py", line 376, in branches
recurse_into_tree = tag_runner.open(tree)
^^^^^^^^^^^^^^^^^^^^^
File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\docx_text.py", line 139, in open
return method(tree)
^^^^^^^^^^^^
File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\docx_text.py", line 183, in _open_comment_range_start
self.tables.start_comment_range(tree.attrib[qn("w:id")])
File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\depth_collector.py", line 127, in start_comment_range
cruns = self._count_runs()
^^^^^^^^^^^^^^^^^^
File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\depth_collector.py", line 120, in _count_runs
return len(list(self._runs_so_far))
^^^^^^^^^^^^^^^^^^^^^^^
File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\depth_collector.py", line 108, in _runs_so_far
assert len(self.open_pars) == 1
AssertionError
text=pars.text traceback:
>>> text=pars.text
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\docx_output.py", line 212, in text
return flatten_text(self.document_runs, do_pStyle)
^^^^^^^^^^^^^^^^^^
File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\docx_output.py", line 199, in document_runs
+ self.body_runs
^^^^^^^^^^^^^^
File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\docx_output.py", line 155, in body_runs
return self.officeDocument_runs
^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\docx_output.py", line 145, in officeDocument_runs
return self._get_runs("officeDocument")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\docx_output.py", line 120, in _get_runs
content += file.content
^^^^^^^^^^^^
File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\docx_reader.py", line 246, in content
return self.get_content()
^^^^^^^^^^^^^^^^^^
File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\docx_reader.py", line 259, in get_content
return cast(TablesList, self.depth_collector.tree)
^^^^^^^^^^^^^^^^^^^^
File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\docx_reader.py", line 237, in depth_collector
self.__depth_collector = self.__depth_collector or new_depth_collector(self)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\docx_text.py", line 384, in new_depth_collector
branches(root)
File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\docx_text.py", line 380, in branches
branches(branch)
File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\docx_text.py", line 380, in branches
branches(branch)
File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\docx_text.py", line 380, in branches
branches(branch)
[Previous line repeated 1 more time]
File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\docx_text.py", line 376, in branches
recurse_into_tree = tag_runner.open(tree)
^^^^^^^^^^^^^^^^^^^^^
File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\docx_text.py", line 139, in open
return method(tree)
^^^^^^^^^^^^
File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\docx_text.py", line 183, in _open_comment_range_start
self.tables.start_comment_range(tree.attrib[qn("w:id")])
File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\depth_collector.py", line 127, in start_comment_range
cruns = self._count_runs()
^^^^^^^^^^^^^^^^^^
File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\depth_collector.py", line 120, in _count_runs
return len(list(self._runs_so_far))
^^^^^^^^^^^^^^^^^^^^^^^
File "C:\...\temp\.docx2python\Lib\site-packages\docx2python\depth_collector.py", line 108, in _runs_so_far
assert len(self.open_pars) == 1
AssertionError
Both of these should be fixed now. Please let me know if you see the same failures with 2.9.2
That did it. Both text and comments properly parsed in 2.9.2. Thanks!
Glad to hear it. I'm moving this to PyPI main.
Using Docx2Python in a data discovery tool to find PII and other sensitive data. Reading comments in DOCX files would be fantastic. I've attached a sample file. For my part, it doesn't need to be fancy -- even returning comments and their responses as a list of strings would do the trick.
lorem-ipsum-1line-comments.docx