eeditiones / tei-publisher-app

The main TEI Publisher app
https://teipublisher.com
GNU General Public License v3.0
65 stars 32 forks source link

Word Document Upload with SDT missing data #130

Open phollott opened 2 years ago

phollott commented 2 years ago

I am using a TEI Publisher application to upload and convert DOCX files, but when the source document contains structured document tags (which some of my source documents do), the text within the tags is missing in the TEI that is generated.

To reproduce:

If you upload the attached document into TEI Publisher, the text "TEST1" and "TEST2" is expected in the resulting TEI, but it is missing, because it is embedded within structured document document tags in Word, in a table cell and in a paragraph, respectively.

Thank you for any light you might be able to shed on this. I suspect this would be an additional conditional pathway in transform/docx-tei.xql to transclude any w:sdt elements in the document xml or something like that, but I have been unable to figure out how to make this work.

sdt-test.docx

phollott commented 2 years ago

I found a solution that works well enough:

  1. Edited docx.odd in the ODD Editor to add a new element for sdt, which constructs a paragraph with content based on descendant::r when the sdt block has parent::p
  2. Edited docx.odd in the ODD Editor to modify cell so it pulls content from descendant-or-self::p instead of just p

These changes work for what I am trying to do, so far, although I would have preferred a solution that just has two models for sdt, but I was struggling to figure out how to get the model for cell to work. It's a work in progress, and sdt occurring in Word documents is a pain, but some use cases I have require them.

tuurma commented 1 year ago

@phollott would you consider a PR with your extension?

phollott commented 1 year ago

@tuurma perhaps... I have some changes that may be useful for others during DOCX to TEI conversion. The project I have been working on involves things like conversion of subscript and superscript, working around Word SDT (which is a pain), and a number of other features, which I might be able to include in a pull request.