dt-woods / word

Concatenate and parse Microsoft Word (.docx) files with style! A Pythonic method for splitting, merging, and styling MS Word docs.
3 stars 1 forks source link

Include images #6

Closed dt-woods closed 3 years ago

dt-woods commented 3 years ago

Include embedded images found in document paragraphs in merge and parse.

dt-woods commented 3 years ago

The Structure of OpenXML Document

The document has a root element document with a main-story container, called the body, which stored block-level containers, such as paragraphs, runs, and text. These elements can be identified using the tags below.

Tag Description
<w:p> Begin paragraph element
</w:p> End paragraph element
<w:r> Begin run element
</w:r> End run element
<w:t> Begin text element
</w:t> End text element
<w:hyperlink> Begin hyperlink
<w:rPr> Begin run properties
<w:pPr> Begin paragraph properties
<w:drawing> Begin drawing element

Note that the paragraphs are assigned identifiers in their paragraph tag, e.g., <w:p w14:paraId="asdf3920">.

References

dt-woods commented 3 years ago

Example run with drawing:

<w:r>
  <w:rPr><w:noProof/></w:rPr>
  <w:drawing>
    <wp:inline distT="0" distB="0" distL="0" distR="0" wp14:anchorId="64B26DF1" wp14:editId="24654627">
      <wp:extent cx="5943600" cy="1858010"/>
      <wp:effectExtent l="0" t="0" r="0" b="8890"/>
      <wp:docPr id="3" name="Picture 3" descr="A grassy field with hills in the background&#xA;&#xA;Description automatically generated with low confidence"/>
      <wp:cNvGraphicFramePr>
        <a:graphicFrameLocks xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main" noChangeAspect="1"/>
      </wp:cNvGraphicFramePr>
      <a:graphic xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main">
        <a:graphicData uri="http://schemas.openxmlformats.org/drawingml/2006/picture">
          <pic:pic xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture">
            <pic:nvPicPr>
              <pic:cNvPr id="3" name="Picture 3" descr="A grassy field with hills in the background&#xA;&#xA;Description automatically generated with low confidence"/>
              <pic:cNvPicPr>
                <a:picLocks noChangeAspect="1" noChangeArrowheads="1"/>
              </pic:cNvPicPr>
            </pic:nvPicPr>
            <pic:blipFill>
              <a:blip r:embed="rId7">
                <a:extLst>
                  <a:ext uri="{28A0092B-C50C-407E-A947-70E740481C1C}">
                    <a14:useLocalDpi xmlns:a14="http://schemas.microsoft.com/office/drawing/2010/main" val="0"/>
                  </a:ext>
                </a:extLst>
              </a:blip>
              <a:srcRect/>
              <a:stretch><a:fillRect/></a:stretch>
            </pic:blipFill>
            <pic:spPr bwMode="auto">
              <a:xfrm>
                <a:off x="0" y="0"/>
                <a:ext cx="5943600" cy="1858010"/>
              </a:xfrm>
              <a:prstGeom prst="rect"><a:avLst/></a:prstGeom>
              <a:noFill/>
              <a:ln><a:noFill/></a:ln>
            </pic:spPr>
          </pic:pic>
        </a:graphicData>
      </a:graphic>
    </wp:inline>
  </w:drawing>
</w:r>
dt-woods commented 3 years ago

Example document.xml.rels

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
  <Relationship Id="rId8" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/hyperlink" Target="https://commons.wikimedia.org/wiki/File:Button_w_no_arrow2.png" TargetMode="External"/>
  <Relationship Id="rId13" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/hyperlink" Target="https://creativecommons.org/licenses/by-sa/4.0/deed.en" TargetMode="External"/>
  <Relationship Id="rId3" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/settings" Target="settings.xml"/>
  <Relationship Id="rId7" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image" Target="media/image3.jpeg"/>
  <Relationship Id="rId12" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/hyperlink" Target="https://commons.wikimedia.org/wiki/File:Berlou_panoramic_01.jpg" TargetMode="External"/>
  <Relationship Id="rId2" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles" Target="styles.xml"/>
  <Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/numbering" Target="numbering.xml"/>
  <Relationship Id="rId6" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image" Target="media/image2.png"/>
  <Relationship Id="rId11" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/hyperlink" Target="https://creativecommons.org/licenses/by-sa/4.0/deed.en" TargetMode="External"/>
  <Relationship Id="rId5" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image" Target="media/image1.png"/>
  <Relationship Id="rId15" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/theme" Target="theme/theme1.xml"/>
  <Relationship Id="rId10" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/hyperlink" Target="https://commons.wikimedia.org/wiki/File:Bw_copy_icon_32x32.png" TargetMode="External"/>
  <Relationship Id="rId4" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/webSettings" Target="webSettings.xml"/>
  <Relationship Id="rId9" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/hyperlink" Target="https://creativecommons.org/licenses/by-sa/2.5/deed.en" TargetMode="External"/>
  <Relationship Id="rId14" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/fontTable" Target="fontTable.xml"/>
</Relationships>
dt-woods commented 3 years ago

The link appears to be <a:blip r:embed="rId7"> in the document.xml with the <Relationship Id="rId7" in the document.xml.rels.