Open GoogleCodeExporter opened 9 years ago
The above example in comment 1 was on Okapi version 0.24.
Updated the Okapi version to 0.26 and parsed the attached file - input.docx.
The text content was pretty simple with an image embedded.
The text units generated were -
[#$dp2][#$dp3]Some results may have been blocked under EU data protection law.
</w:t></w:r>[#$dp4]Learn more</w:t></w:r>
Picture 1
<w:r><w:rPr><w:rFonts w:ascii="Arial" w:hAnsi="Arial"
w:cs="Arial"/><w:b/><w:i/><w:sz w:val="22"/><w:szCs
w:val="22"/></w:rPr><w:t>There is also a text after image</w:t></w:r>
The output file generated complains about invalid contents and fails to open.
Microsoft word for Mac 2011 is the platform.
Original comment by 143.ravi...@gmail.com
on 8 Oct 2014 at 2:23
Attachments:
I've tried the input.docx with both Tikal and Rainbow extraction and merged
back: I got back a docx file with no error, and the bunch of flower image.
The [dp...] markers are there, but as inline codes (as expected) and i don't
see them in the text of the merged document.
I've also tried the two other examples without issues.
I'm not ruling out a bug: i just can't reproduce it for now.
Question:
- what tool are you using for the extraction/merging? and what options (if
applicable)?
Thanks,
-ys
Original comment by yves.sav...@gmail.com
on 8 Oct 2014 at 1:12
I am using an OKAPI pipe line to generate the text units for extraction/merging.
This extraction part works fine as it generates the required text units.
(Hoping that there isn't any place holders required to mark an image location
in any of the text units source)
The issue seems while merging it back using the same RawDocument.
The overridden OpenXMLZipFilterWriter and OpenXMLFilter file is attached.
The only option modified here is - setBPreferenceTranslateDocProperties(false);
The filter starts by processing the docx(Word_Image2.docx) files in the order,
while merging -
1. [Content-Types].xml
2. words/style.xml
3. word/document.xml
4. word/setting.xml
The document.xml generates the following text units of the attached docx file.
1. This is a simple text
2. Picture 1
There is also an untranslatable text unit generated with
textUnit.getSource().hasText() == false : -
textUnit = (net.sf.okapi.common.resource.TextUnit)
<w:r><w:rPr><w:noProof/><w:lang
w:eastAsia="zh-TW"/></w:rPr><w:drawing><wp:inline distT="0" distB="0" distL="0"
distR="0" wp14:anchorId="3F1AD5DF" wp14:editId="038BA628"><wp:extent
cx="3492500" cy="2324100"/><wp:effectExtent l="0" t="0" r="12700"
b="12700"/><wp:docPr id="2" [#$tu3]/><wp:cNvGraphicFramePr><a:graphicFrameLocks
xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main"
noChangeAspect="1"/></wp:cNvGraphicFramePr>[#$sg1]</wp:inline></w:drawing></w:r>
<w:bookmarkStart w:id="0" w:name="_GoBack"/><w:bookmarkEnd w:id="0"/>
The source and target docx files entries are exactly the same except the
following different in the - "word/document.xml"
Source -
<wp:docPr id="2" name="Picture 1"/><wp:cNvGraphicFramePr><a:graphicFrameLocks
xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main"
noChangeAspect="1"/></wp:cNvGraphicFramePr><a:graphic
xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main"><a:graphicData
uri="http://schemas.openxmlformats.org/drawingml/2006/picture"><pic:pic
xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture"><pic:nvPicP
r><pic:cNvPr id="0" name="Picture 1"/>
Target -
wp:docPr id="2" -ERR:REF-NOT-FOUND-/><wp:cNvGraphicFramePr><a:graphicFrameLocks
xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main"
noChangeAspect="1"/></wp:cNvGraphicFramePr><a:graphic
xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main"><a:graphicData
uri="http://schemas.openxmlformats.org/drawingml/2006/picture"><pic:pic
xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture"><pic:nvPicP
r><pic:cNvPr id="0" -ERR:REF-NOT-FOUND-/>
Original comment by 143.ravi...@gmail.com
on 9 Oct 2014 at 4:05
Attachments:
"-ERR:REF-NOT-FOUND-"
This is a merge error string. I'll see if I can debug this using our tkit
integration test. Possible the latest changes in m27 will give different
results.
Original comment by jhargrav...@gmail.com
on 9 Oct 2014 at 4:32
I have confirmed that with the latest m27 - with default configuration that all
attached files extract and merge without problems. I also tried with
"openXmlFilter.setBPreferenceTranslateDocProperties(false);" and go the same
results.
This is with and without segmentation.
I'm leaning toward either (1) this bug has been fixed in m27 or (2) there is
something else in the pipeline or custom derived filter and writer causing the
problem.
Can you retry your tests with the latest m27-SNAPSHOT? If that doesn't work
please tell us the exact steps in your pipeline.
Original comment by jhargrav...@gmail.com
on 9 Oct 2014 at 5:25
I tried with the m27-SNAPSHOT version but got the same results on my side.
Still seeing the "-ERR:REF-NOT-FOUND-" in the derived document.xml
Attaching my pipe line steps -
For Extraction -
1. ExtractionStep.java - It has the pipe line details m using. Its composed of
the DocTubStep which is used to store the extracted Text Units in DB. Tried
both with segmentation and metrics steps on/off.
For Merging -
1. MergeService.java - It has the pipe line details used for merging. It is
composed of the TranslateStep.java (to fetch the translations from DB for the
extracted Text Units of the RawDocument and a FilterEventsStreamWriterStep ,
which is decorated with the MSFilterWriter to write back the translated
contents into a OutputStream.
Both the Extraction and Merging steps uses the same - WordFileFormat,MSFilter
and MSFilterWriter.
Original comment by 143.ravi...@gmail.com
on 10 Oct 2014 at 3:54
Attachments:
The symptoms of the issue look like the ReferenceFlag info of the inline codes
that have references is not set properly.
If the getData() of a Code has one or more markers starting with "[#$", that
code must have the reference flag set to true (code.setReferenceFlag(true)).
It looks like the events are saved in some kind of DB store in this pipeline.
Maybe that information is not saved properly and is missing when merging back?
Original comment by yves.sav...@gmail.com
on 10 Oct 2014 at 10:47
In tried a word doc where do not have text units having markers - "[#$"
They are simple one like -
"This is a simple text" and one more text unit for the name of the picture
"Picture 1"
I see the a similar output with "-ERR:REF-NOT-FOUND-" I don't store the events
in any DB store, in fact the events are generated fresh for Extraction and
Merge by each pipe line.
I looked at a few test cases on
-http://code.google.com/p/okapi/source/browse/okapi/filters/openxml/src/test/jav
a/net/sf/okapi/filters/openxml
Not sure what else could be missing or corrupting the merge back.
Original comment by 143.ravi...@gmail.com
on 11 Oct 2014 at 2:06
Original issue reported on code.google.com by
143.ravi...@gmail.com
on 7 Oct 2014 at 2:30Attachments: