computerline1z / okapi

Automatically exported from code.google.com/p/okapi
0 stars 0 forks source link

Invalid DOCX files created with Moses InlineText tag rearranging and round-trip #176

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Steps to reproduce:
1. Create a new Word document in Microsoft Word 2007 with the text 
"This is page ."
2. Position the cursor before the period and choose Insert/Page Number/Current 
Position/Plain Number: the number 1 is inserted
3. Save the document as test.docx
4. tikal.bat -xm test.docx -sl en

The first line of the resulting test.docx.en is:
This is page <x id="1"/><g id="2">1</g><x id="3"/>.

5. Edit the first line to read:
This is page <x id="3"/><x id="1"/><g id="2">1</g>.
6. Save as test.docx.fr
7. tikal.bat -lm test.docx -totrg -from test.docx.fr
8. Open the resulting test.out.docx in Microsoft Word

Result:
Word cannot open the file: "The file test.out.docx cannot be opened because 
there are problems with the contents."
Details:
"The name in the end tag of the element must match the element type in the 
start tag." Location: Part: /word/document.xml [...]

Remark:
This kind of tag rearranging, while a bit non-sensical in the example, is 
happening often in longer segments during translation/machine translation.

Analysis of DOCX XML:
9. Extract contents of test.out.docx with extraction program (e.g. 7zip)
10. View file test.out.docx/word/document.xml

Invalid XML: Closing tag </w:fldSimple> appears before opening tag <w:fldSimple 
...>

Original issue reported on code.google.com by Achi...@gmail.com on 4 Jul 2011 at 8:35

GoogleCodeExporter commented 9 years ago
I can reproduce the issue.
Extracting to XLIFf with <bpt>/<ept> shows the DOCX codes:

<ph id="1"><w:fldSimple w:instr=" PAGE \* MERGEFORMAT "></ph>
...
<ph id="3"></w:fldSimple></ph>

Ideally those two placeholders would be paired tags.
But that is difficult to achieve with Word.

Original comment by yves.sav...@gmail.com on 5 Jul 2011 at 3:16

GoogleCodeExporter commented 9 years ago
In this case the preferred behavior for me would be:
1. the filter warns that invalid XML is output
and 
2. the filter escapes or deleted the invalid XML (could be a filter option)

Original comment by Achi...@gmail.com on 5 Jul 2011 at 1:44

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
 I'm an user of Okapi and I had a trouble while opening documents in Microsoft Office 2007: File.docx .  The error message is: /Word / document.xml line 6294 colums 6293. The problem doesn't exist in OpenOffice, there is no problems in the file.docx (I can open the file without any error message)

tikal.sh -lm file.docx -totrg -from aftertest 

Original comment by bailo...@gmail.com on 22 Nov 2013 at 10:06

Attachments: