Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
9.02k stars 742 forks source link

docx: partitioner finds text nested in revision-marks #1821

Open scanny opened 1 year ago

scanny commented 1 year ago

Currently DOCX content nested in revision-marks is skipped when partitioning a .docx file.

Add an "accept-all-revisions" step before partitioning to bring the document to the state most likely intended by the author, such that inserted or modified text is included and deleted text is not.

Additional context Microsoft Word has features that support document review and revision. An author can turn on the "Track Changes" option, send the document to an editor (person) and then any changes made by the editor are clearly marked as suggested revisions. The revisions can be accepted or rejected individually or as a group.

These revisions, when not yet accepted, cause the affected text to be "nested" in revision-mark elements like <w:ins> and <w:del> in the document XML. This causes that text to be skipped by python-docx because it is "beneath" the level it goes looking for paragraphs etc. Further, it's not immediately obvious what the expected behavior should be because simply including all that text will not only show insertions, but also deletions and perhaps duplicate moved text or place it in a different location.

The common solution to this problem is to add an "Accept all revisions" step before processing which removes all revision mark "container" envelopes, adding text in w:ins(ert) elements and removing text in w:del(ete) elements, etc. This is a reasonable assumption of the author's intent because by default this is how the text in the document appears if you forget to accept revisions and turn off the Track-Changes option.

MoritzImendoerffer commented 2 months ago

Hi Scanny, I think it would be best if this feature is implemented in docx itself. Do you agree? However, I am wondering why this feature is not implemented yet. It seems rather simple. Or am I too naive?

MoritzImendoerffer commented 1 month ago

This solution works for me:

import docx

WORD_NAMESPACE = {'w': 'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}
INS_TAG = f"{{{WORD_NAMESPACE['w']}}}ins"
DEL_TAG = f"{{{WORD_NAMESPACE['w']}}}del"

def accept_all_revisions_in_doc(doc):
    """
    Accept all revisions (track changes) in the docx document by searching for `w:ins`
    (inserted elements) and `w:del` (deleted elements) in the entire XML tree,
    including headers and footers.

    :param doc: A docx document object to process.
    """
    # main body only
    _process_element(doc.element)
    # each seaction separately (e.g. heaers and footers)
    for section in doc.sections:
        _process_element(section.header._element)
        _process_element(section.footer._element)

    _process_footnotes_endnotes(doc)
    _process_comments(doc)
    _process_textboxes(doc)

def _process_element(element):
    """Process any XML element to accept insertions and remove deletions."""
    _accept_all_insertions(element)
    _remove_all_deletions(element)

def _accept_all_insertions(element):
    """Accept all inserted content in the document by keeping `w:ins` elements."""
    for ins in element.findall(f".//{INS_TAG}"):
        parent = ins.getparent()
        for child in ins:
            parent.insert(parent.index(ins), child)
        parent.remove(ins)

def _remove_all_deletions(element):
    """Remove all deleted content in the document by removing `w:del` elements."""
    for deletion in element.findall(f".//{DEL_TAG}"):
        deletion.getparent().remove(deletion)

def _process_footnotes_endnotes(doc):
    """Process footnotes and endnotes in the document."""
    footnotes_part = doc.part.related_parts.get(
        '{http://schemas.openxmlformats.org/officeDocument/2006/relationships/footnotes}')
    endnotes_part = doc.part.related_parts.get(
        '{http://schemas.openxmlformats.org/officeDocument/2006/relationships/endnotes}')

    if footnotes_part:
        footnotes_xml = footnotes_part.element
        _process_element(footnotes_xml)

    if endnotes_part:
        endnotes_xml = endnotes_part.element
        _process_element(endnotes_xml)

def _process_comments(doc):
    """Process comments in the document."""
    comments_part = doc.part.related_parts.get(
        '{http://schemas.openxmlformats.org/officeDocument/2006/relationships/comments}')
    if comments_part:
        comments_xml = comments_part.element
        _process_element(comments_xml)

def _process_textboxes(doc):
    """Process textboxes and shapes in the document."""
    for shape in doc.element.findall(f'.//{{{WORD_NAMESPACE["w"]}}}textbox'):
        _process_element(shape)
MoritzImendoerffer commented 1 month ago

I added a docx_helpers module in utils

    @lazyproperty
    def document(self) -> Document:
        """The python-docx `Document` object loaded from file or filename."""
        _file = docx.Document(self._docx_file)
        if self._accept_changes:
            docx_helpers.accept_all_revisions_in_doc(_file)
        return _file

A test could look like:

def test_partition_docx_accept_track_changes():
    mock_document_file_path = example_doc_path("docx-tables_with_track_changes.docx")
    elements = partition_docx(mock_document_file_path)
    text = " ".join([item.text for item in elements])
    all(
        (
            "HeaderWithTrackChanges" in text,
            "CellWithTrackChanges" in text,
            "TextWithTrackChanges" in text,
            "FooterWithTrackChanges" in text
        )
    )
    assert all

I generated a docx file with track changes:

docx-tables_with_track_changes.docx

MoritzImendoerffer commented 1 month ago

@scanny Any feedback would be great.

scanny commented 1 month ago

@MoritzImendoerffer that looks like the idea. I agree it would be best for this to go upstream in python-docx, perhaps as a Document.accept_all_revisions() method that would encapsulate the details.

I think I would use XPath rather than .findall() as I believe it performs better. So finding w:ins elements in the main document body might be ins_elms = body_elm.xpath(".//w:ins").

I'd also like to see some narrative analysis of where these elements can appear as we'd want to avoid searching for them except where we could possibly find them. Performance is a consideration because we'd need to run this on all documents before processing. In particular, "//" segments in the XPath are very expensive because it causes searches to visit every descendant. Something like ins_elms = body.xpath("./w:ins | ./w:p/w:ins") for example would be much snappier. A study of the XML schema would reveal those possible locations and some profiling on a very large document would give a sense of performance and where to balance detail in the XPath expression without sacrificing completeness.

But if this works then I'd say the rest is in the category of polishing. Thanks for working this through :)