CenterForOpenScience / pydocx

An extendable docx file format parser and converter
Other
186 stars 55 forks source link

Hyperlink imports as strong tag instead of anchor tag #201

Open winhamwr opened 8 years ago

winhamwr commented 8 years ago

I'm unsure of the exact cause, but attached .docx has a hyperlink around the text http://translate.google.com/#, but when run through pydocx, the resulting HTML is just surrounded in strong tags instead of an anchor tag.

Interestingly, if I open the file in Open Office and save it again, the internal structure changes and running the resulting file through pydocx results in correct behavior. hyperlink_did_not_translate.docx

kylegibson commented 8 years ago

Ugh. This is because the instrText is spread out over several nodes. I made the assumption that this would not happen, because it's silly:

      <w:r>
        <w:instrText xml:space="preserve"> HYPERLINK "</w:instrText>
      </w:r>
      <w:r w:rsidRPr="00710528">
        <w:instrText>http://translate.google.com/#</w:instrText>
      </w:r>
      <w:r>
        <w:instrText xml:space="preserve">" </w:instrText>
      </w:r>

PyDocX only handles the instrText HYPERLINK if it is formatted like this:

      <w:r>
        <w:instrText xml:space="preserve"> HYPERLINK "http://translate.google.com/#"</w:instrText>
      </w:r>

I suspect it's happening because of the # in the URL. Maybe word sees this as a special character.