ShayHill / docx2python

Extract docx headers, footers, (formatted) text, footnotes, endnotes, properties, and images.
https://docx2python.readthedocs.io/en/latest/
MIT License
167 stars 36 forks source link

Find and Replace (XML) + XML element creation #51

Closed amastis closed 8 months ago

amastis commented 8 months ago

General Issue:

Something that I would like to do is to replace text (currently the number of another footnote but not an XML reference) in a footnote with a cross reference to another footnote.

Problem Steps

  1. Search footnotes for the text (such as with .replace_root_text), but that doesn't have either:
    1. XML insert support (new :param: is a str) [and have tried inserting a str representation of what a reference would look like]
      1. (assuming that the parent object is where the insertion is to happen -- but if the grandparent element is where the insertion is to happen how would the XML object have to be changed / inserted properly)
    2. cannot selectively insert text based on a pattern
  2. Get the reference object to another footnote (possibly implementing a find_element dict that has references to already referenceable types [footnotes, endnotes, etc.])
  3. create an XML object to another footnote (how could we programmatically do this?)
  4. insert based on the pattern

What I Tried

Method: searching for 2323 in a footnote to replace that text with the reference to footnote 23 on my document (pulled this information from the XML file that was created when I manually made a cross reference)

Partial idea from https://github.com/python-openxml/python-docx/issues/359

docx_temp = docx_reader.DocxReader(file_path)
print(docx_temp.files_of_type('footnotes')[0])
test_link = '''<w:r w:rsidR="0050576E">
            <w:rPr>
                <w:szCs w:val="24"/>
            </w:rPr>
            <w:fldChar w:fldCharType="begin"/>
        </w:r>
        <w:r w:rsidR="0050576E">
            <w:rPr>
                <w:szCs w:val="24"/>
            </w:rPr>
            <w:instrText xml:space="preserve"> NOTEREF _Ref162052430 \h </w:instrText>
        </w:r>
        <w:r w:rsidR="0050576E">
            <w:rPr>
                <w:szCs w:val="24"/>
            </w:rPr>
        </w:r>
        <w:r w:rsidR="0050576E">
            <w:rPr>
                <w:szCs w:val="24"/>
            </w:rPr>
            <w:fldChar w:fldCharType="separate"/>
        </w:r>
        <w:r w:rsidR="0050576E">
            <w:rPr>
                <w:szCs w:val="24"/>
            </w:rPr>
            <w:t>23</w:t>
        </w:r>
        <w:r w:rsidR="0050576E">
            <w:rPr>
                <w:szCs w:val="24"/>
            </w:rPr>
            <w:fldChar w:fldCharType="end"/>
        </w:r>
        <w:r w:rsidR="0050576E">
            <w:rPr>
                <w:szCs w:val="24"/>
            </w:rPr>
            <w:fldChar w:fldCharType="begin"/>
        </w:r>
        <w:r w:rsidR="0050576E">
            <w:rPr>
                <w:szCs w:val="24"/>
            </w:rPr>
            <w:instrText xml:space="preserve"> NOTEREF _Ref162052430 \h </w:instrText>
        </w:r>
        <w:r w:rsidR="0050576E">
            <w:rPr>
                <w:szCs w:val="24"/>
            </w:rPr>
        </w:r>
        <w:r w:rsidR="0050576E">
            <w:rPr>
                <w:szCs w:val="24"/>
            </w:rPr>
            <w:fldChar w:fldCharType="separate"/>
        </w:r>
        <w:r w:rsidR="0050576E">
            <w:rPr>
                <w:szCs w:val="24"/>
            </w:rPr>
            <w:t>23</w:t>
        </w:r>
        <w:r w:rsidR="0050576E">
            <w:rPr>
                <w:szCs w:val="24"/>
            </w:rPr>
            <w:fldChar w:fldCharType="end"/>
        </w:r>
        <w:r w:rsidR="0050576E">
            <w:rPr>
                <w:szCs w:val="24"/>
            </w:rPr>
            <w:t xml:space="preserve"></w:t>
        </w:r>'''

utilities.replace_root_text(docx_temp.files_of_type('footnotes')[0].root_element, '2323', test_link.replace('\t', '').replace('\n', '')) # in case the string formatting would not properly interpret the string representation when inserting it

Do you have any advice on what to do differently?

ShayHill commented 8 months ago

I think you're on the right track. If you open the file as a DocxReader instance, then find the rels file through the footnote instances in the officeDocument file, then you can continue from there with the lxml interface. Lxml will allow you to create and insert new elements (perhaps you'd want to create a copy of an existing element then edit it).

amastis commented 8 months ago

Request - formatting gone crazy

Using a modified version of the .replace_root_text all of the formatting of my document goes crazy (going into superscript and nullifying some previous formatting changes). See below for modified function.

@ShayHill What am I doing that is causing this to happen or is this something that may be known to occur because of the updating of XML?

def split_text(root: EtreeElement, split_text: str, position: int) -> EtreeElement:
    text = root.text.split(split_text)[position]
    new_elems = [_copy_new_text(root, line) for line in text.splitlines()]

    # insert breakpoints where line breaks were
    breaks = [etree.Element(Tags.BR) for _ in new_elems]
    return [x for pair in zip(new_elems, breaks) for x in pair][:-1]

def replace_root_text(root: EtreeElement, old: str, new: str) -> None:
    """Replace :old: with :new: in all descendants of :root:

    :param root: an etree element presumably containing descendant text elements
    :param old: text to be replaced
    :param new: replacement text

    Will use softbreaks <br> to preserve line breaks in replacement text.
    """

    def recursive_text_replace(branch: EtreeElement):
        """Replace any text element contining old with one or more elements.

        :param branch: an etree element
        """
        for elem in tuple(branch):
            if not elem.text or old not in elem.text:
                recursive_text_replace(elem)
                continue

            # split the text into two elements (based on the position of the old text)
            left_side = split_text(elem, old, 0)
            right_side = split_text(elem, old, -1)

            # replace the original element with the new elements
            parent = elem.getparent()
            assert parent is not None
            index = parent.index(elem)

            parent[index : index + 1] = [left_side[0], right_side[0]]

    recursive_text_replace(root)

Progress on Footnote References

Investigating further it looks like there are two reference spots (one in the individual footnote that I am trying to create above), and a second version in the document body itself (usually located before the footnote reference) there is a newly created <w:bookmarkStart w:id="21" w:name="_Ref162052430"/> which pairs with the "_Ref162052430" reference in the footnote.

I have created a way to create the footnote reference—now looking to find and place the corresponding document reference. (will update when having a working version, but slightly limited due to the issue posed at the top of this comment).

ShayHill commented 8 months ago

It looks like you're throwing out a lot of text if split_text appears more than one time in root.text. That would definitely garble something. From looking at the code, it seems you're assuming text = root.text.split(split_text)[position] will always split root text into two pieces, because you only look at 0 and -1 in replace_root_text.