computerline1z / okapi

Automatically exported from code.google.com/p/okapi
0 stars 0 forks source link

Open XML filter for Word doc generates [#$dp] segments prefixed with Text Units. #419

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
The two word files attached has exactly the same contents but they produce two 
different types of text units content -

Example3.docx generates - 

[#$dp2]<w:r><w:rPr><w:rFonts w:ascii="Arial" w:hAnsi="Arial"/></w:rPr><w:t 
xml:space="preserve">I am a simple text. What do you 
</w:t></w:r><w:bookmarkStart w:id="0" w:name="_GoBack"/><w:bookmarkEnd 
w:id="0"/><w:r><w:rPr><w:rFonts w:ascii="Arial" 
w:hAnsi="Arial"/></w:rPr><w:t>think?</w:t></w:r>

This has an a meta data tag related to "Document Part" event prefixed with it.

The Open XML filters fails to merge it back correctly and the translated 
document fails to open correctly.

At the same time since the Text Unit generated for the Example 2 does not have 
the "[#$dp2]" and works as expected.

Is there any reason or sequence the filters parses the many xml files inside a 
given .docx file ?

Thanks

Original issue reported on code.google.com by 143.ravi...@gmail.com on 7 Oct 2014 at 2:30

Attachments:

GoogleCodeExporter commented 9 years ago
The above example in comment 1 was on Okapi version 0.24.

Updated the Okapi version to 0.26 and parsed the attached file - input.docx.
The text content was pretty simple with an image embedded.

The text units generated were -

[#$dp2][#$dp3]Some results may have been blocked under EU data protection law. 
</w:t></w:r>[#$dp4]Learn more</w:t></w:r>

Picture 1

<w:r><w:rPr><w:rFonts w:ascii="Arial" w:hAnsi="Arial" 
w:cs="Arial"/><w:b/><w:i/><w:sz w:val="22"/><w:szCs 
w:val="22"/></w:rPr><w:t>There is also a text after image</w:t></w:r>

The output file generated complains about invalid contents and fails to open.

Microsoft word for Mac 2011 is the platform.

Original comment by 143.ravi...@gmail.com on 8 Oct 2014 at 2:23

Attachments:

GoogleCodeExporter commented 9 years ago
I've tried the input.docx with both Tikal and Rainbow extraction and merged 
back: I got back a docx file with no error, and the bunch of flower image.
The [dp...] markers are there, but as inline codes (as expected) and i don't 
see them in the text of the merged document.
I've also tried the two other examples without issues.

I'm not ruling out a bug: i just can't reproduce it for now.

Question:
- what tool are you using for the extraction/merging? and what options (if 
applicable)?

Thanks,
-ys

Original comment by yves.sav...@gmail.com on 8 Oct 2014 at 1:12

GoogleCodeExporter commented 9 years ago
I am using an OKAPI pipe line to generate the text units for extraction/merging.

This extraction part works fine as it generates the required text units.
(Hoping that there isn't any place holders required to mark an image location 
in any of the text units source)

The issue seems while merging it back using the same RawDocument.

The overridden OpenXMLZipFilterWriter and OpenXMLFilter file is attached. 

The only option modified here is - setBPreferenceTranslateDocProperties(false);

The filter starts by processing the docx(Word_Image2.docx) files in the order, 
while merging -
1. [Content-Types].xml
2. words/style.xml
3. word/document.xml
4. word/setting.xml

The document.xml generates the following text units of the attached docx file.

1. This is a simple text
2. Picture 1

There is also an untranslatable text unit generated with 
textUnit.getSource().hasText() == false : -

textUnit = (net.sf.okapi.common.resource.TextUnit) 
<w:r><w:rPr><w:noProof/><w:lang 
w:eastAsia="zh-TW"/></w:rPr><w:drawing><wp:inline distT="0" distB="0" distL="0" 
distR="0" wp14:anchorId="3F1AD5DF" wp14:editId="038BA628"><wp:extent 
cx="3492500" cy="2324100"/><wp:effectExtent l="0" t="0" r="12700" 
b="12700"/><wp:docPr id="2" [#$tu3]/><wp:cNvGraphicFramePr><a:graphicFrameLocks 
xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main" 
noChangeAspect="1"/></wp:cNvGraphicFramePr>[#$sg1]</wp:inline></w:drawing></w:r>
<w:bookmarkStart w:id="0" w:name="_GoBack"/><w:bookmarkEnd w:id="0"/>

The source and target docx files entries are exactly the same except the 
following different in the - "word/document.xml"

Source -

<wp:docPr id="2" name="Picture 1"/><wp:cNvGraphicFramePr><a:graphicFrameLocks 
xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main" 
noChangeAspect="1"/></wp:cNvGraphicFramePr><a:graphic 
xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main"><a:graphicData 
uri="http://schemas.openxmlformats.org/drawingml/2006/picture"><pic:pic 
xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture"><pic:nvPicP
r><pic:cNvPr id="0" name="Picture 1"/>

Target -

wp:docPr id="2" -ERR:REF-NOT-FOUND-/><wp:cNvGraphicFramePr><a:graphicFrameLocks 
xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main" 
noChangeAspect="1"/></wp:cNvGraphicFramePr><a:graphic 
xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main"><a:graphicData 
uri="http://schemas.openxmlformats.org/drawingml/2006/picture"><pic:pic 
xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture"><pic:nvPicP
r><pic:cNvPr id="0" -ERR:REF-NOT-FOUND-/>

Original comment by 143.ravi...@gmail.com on 9 Oct 2014 at 4:05

Attachments:

GoogleCodeExporter commented 9 years ago
"-ERR:REF-NOT-FOUND-"

This is a merge error string. I'll see if I can debug this using our tkit 
integration test. Possible the latest changes in m27 will give different 
results.

Original comment by jhargrav...@gmail.com on 9 Oct 2014 at 4:32

GoogleCodeExporter commented 9 years ago
I have confirmed that with the latest m27 - with default configuration that all 
attached files extract and merge without problems. I also tried with 
"openXmlFilter.setBPreferenceTranslateDocProperties(false);" and go the same 
results.

This is with and without segmentation.

I'm leaning toward either (1) this bug has been fixed in m27 or (2) there is 
something else in the pipeline or custom derived filter and writer causing the 
problem.

Can you retry your tests with the latest m27-SNAPSHOT?  If that doesn't work 
please tell us the exact steps in your pipeline.

Original comment by jhargrav...@gmail.com on 9 Oct 2014 at 5:25

GoogleCodeExporter commented 9 years ago
I tried with the m27-SNAPSHOT version but got the same results on my side. 
Still seeing the "-ERR:REF-NOT-FOUND-" in the derived document.xml

Attaching my pipe line steps -
 For Extraction -

1. ExtractionStep.java - It has the pipe line details m using. Its composed of 
the DocTubStep which is used to store the extracted Text Units in DB.  Tried 
both with segmentation and metrics steps on/off.

For Merging -

1. MergeService.java - It has the pipe line details used for merging. It is 
composed of the TranslateStep.java (to fetch the translations from DB for the 
extracted Text Units of the RawDocument and a FilterEventsStreamWriterStep , 
which is decorated with the MSFilterWriter to write back the translated 
contents into a OutputStream.  

Both the Extraction and Merging steps uses the same - WordFileFormat,MSFilter 
and MSFilterWriter.

Original comment by 143.ravi...@gmail.com on 10 Oct 2014 at 3:54

Attachments:

GoogleCodeExporter commented 9 years ago
The symptoms of the issue look like the ReferenceFlag info of the inline codes 
that have references is not set properly.

If the getData() of a Code has one or more markers starting with "[#$", that 
code must have the reference flag set to true (code.setReferenceFlag(true)).

It looks like the events are saved in some kind of DB store in this pipeline. 
Maybe that information is not saved properly and is missing when merging back?

Original comment by yves.sav...@gmail.com on 10 Oct 2014 at 10:47

GoogleCodeExporter commented 9 years ago
In tried a word doc where do not have text units having markers - "[#$"
They are simple one like -

"This is a simple text"  and one more text unit for the name of the picture 
"Picture 1"

I see the a similar output with "-ERR:REF-NOT-FOUND-" I don't store the events 
in any DB store, in fact the events are generated fresh for Extraction and 
Merge by each pipe line. 

I looked at a few test cases on 
-http://code.google.com/p/okapi/source/browse/okapi/filters/openxml/src/test/jav
a/net/sf/okapi/filters/openxml

Not sure what else could be missing or corrupting the merge back.

Original comment by 143.ravi...@gmail.com on 11 Oct 2014 at 2:06