computerline1z / okapi

Automatically exported from code.google.com/p/okapi
0 stars 0 forks source link

XMLStreamFilter merges back CDATA section incorrectly with Okapi M20 #320

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
I have a xml file which has html content inside the CDATA section.

<SOLUTIONS><TITLE><![CDATA[<p>The Test Search 
Alliance</p>]]></TITLE></SOLUTIONS>

The generated XLIFF using the XMLStream Filter for OKPAI M20 contains an extra 
text unit with a new placeholder -

<body>
  <trans-unit id="55" resname="{group:sg1,tu:tu1}" xml:space="preserve">
    <source xml:lang="en"><ph id="1">[#$tu1_ssf1]</ph></source>
    <seg-source><mrk mid="0" mtype="seg"><ph id="1">[#$tu1_ssf1]</ph></mrk></seg-source>
    <target xml:lang="es-ES" state="new"><mrk mid="0" mtype="seg"><ph id="1">[#$tu1_ssf1]</ph></mrk></target>
  </trans-unit>
  <trans-unit id="54" resname="sd1_1" xml:space="preserve">
    <source xml:lang="en">The Test Search Alliance</source>
    <seg-source><mrk mid="0" mtype="seg">The Test Search Alliance</mrk></seg-source>
    <target xml:lang="es-ES" state="new"><mrk mid="0" mtype="seg">The Test Search Alliance</mrk></target>
  </trans-unit>
</body>

Once the translation is done the generated localized file looks like -

<SOLUTIONS><TITLE><p>The Test Search Alliance 
TRANSLATED</p><![CDATA[[#$tu1_ssf1]]]></TITLE></SOLUTIONS>

The CDATA is not getting merged back correctly. Its pushed to the end of the 
translated string with the content equal to the value of initially generated 
placeholder in the XLIFF.

I am using okapi M 20 and the above code used to all work fine with okpai M 14.

Is there anything missing here ? Also attaching my yml file here for reference.

Original issue reported on code.google.com by 143.ravi...@gmail.com on 28 Mar 2013 at 7:24

Attachments:

GoogleCodeExporter commented 9 years ago
Using your yaml configuration I tried creating a generic xliff package for the 
string:

<SOLUTIONS><TITLE><![CDATA[<p>The Test Search 
Alliance</p>]]></TITLE></SOLUTIONS>

The xliff I get:
<?xml version="1.0" encoding="UTF-8"?>
<xliff version="1.2" xmlns="urn:oasis:names:tc:xliff:document:1.2" 
xmlns:okp="okapi-framework:xliff-extensions" 
xmlns:its="http://www.w3.org/2005/11/its">
<file original="/test.xml" source-language="en-us" target-language="fr-fr" 
datatype="xml">
<body>
<group id="sg1">
<group id="tu1_ssf1" resname="sub-filter:sd1">
<trans-unit id="tu1_tu1" resname="sd1_1" restype="x-paragraph">
<source xml:lang="en-us">The Test Search Alliance</source>
<target xml:lang="fr-fr">The Test Search Alliance</target>
</trans-unit>
</group>
<trans-unit id="tu1" restype="x-cdata">
<source xml:lang="en-us"><x id="1"/></source>
<target xml:lang="fr-fr"><x id="1"/></target>
</trans-unit>
</group>
</body>
</file>
</xliff>

That seems to merge back fine. Can you give some more details? Not sure why 
your xliff looks so different from mine. Are you running a custom pipeline?

Original comment by fli...@enlaso.com on 28 Mar 2013 at 7:53

GoogleCodeExporter commented 9 years ago
I am also using my own filter which is extending the XmlStreamFilter. Attaching 
the same here. It is also configured inside the FilterConfigurationMapper and I 
get the right filter instance based on my custom mime-type.

Attaching the filter for reference.

The pile line is as follows - 

IPipelineDriver driver = new PipelineDriver();
driver.setFilterConfigurationMapper(iFilterConfigurationMapper);
driver.addStep(new RawDocumentToFilterEventsStep());
driver.addBatchItem(rawDocument);
driver.processBatch();

Original comment by 143.ravi...@gmail.com on 28 Mar 2013 at 9:23

Attachments:

GoogleCodeExporter commented 9 years ago
Not sure there's anything in the filter that would cause it. Can you confirm 
that if you're using the plain xmlstream filter with the your yaml 
configuration it works? 
What are you doing in terms of extraction/merge? The pipeline doesn't show 
that. The xliff seems to have segmentation as well.
If I extract it with default segmentation and the <ph> format I get this output.

<group id="sg1">
<group id="tu1_ssf1" resname="sub-filter:sd1">
<trans-unit id="tu1_tu1" resname="sd1_1" restype="x-paragraph">
<source xml:lang="en-us">The Test Search Alliance</source>
<seg-source><mrk mid="0" mtype="seg">The Test Search Alliance</mrk></seg-source>
<target xml:lang="fr-fr"><mrk mid="0" mtype="seg">The Test Search 
Alliance</mrk></target>
</trans-unit>
</group>
<trans-unit id="tu1" restype="x-cdata">
<source xml:lang="en-us"><ph id="1">[#$tu1_ssf1]</ph></source>
<seg-source><mrk mid="0" mtype="seg"><ph 
id="1">[#$tu1_ssf1]</ph></mrk></seg-source>
<target xml:lang="fr-fr"><mrk mid="0" mtype="seg"><ph 
id="1">[#$tu1_ssf1]</ph></mrk></target>
</trans-unit>
</group>

Seems your xliff is missing the reference to the #$tu1_ssf1.

Original comment by fli...@enlaso.com on 28 Mar 2013 at 6:41

GoogleCodeExporter commented 9 years ago
Hi,
I am also finally able to get the same output as yours. The reason why it was 
not generating the XLIFF as expected was coz of I was not handling the 
subfilter events in my step which were not there as part of okapi M 14.

After handling the same as follows the output came up as expected -

            case END_SUBDOCUMENT:
            case START_GROUP:
            case END_GROUP:
            case TEXT_UNIT:
            case DOCUMENT_PART:
                return iFilterWriter.handleEvent(event);

Is it actually normal to have a text unit generated for just having a reference 
to the CData - 

<trans-unit id="tu1" restype="x-cdata">
<source xml:lang="en-us"><ph id="1">[#$tu1_ssf1]</ph></source>
<seg-source><mrk mid="0" mtype="seg"><ph 
id="1">[#$tu1_ssf1]</ph></mrk></seg-source>
<target xml:lang="fr-fr"><mrk mid="0" mtype="seg"><ph 
id="1">[#$tu1_ssf1]</ph></mrk></target>
</trans-unit>

With Okapi M 14 the CData used to be part of the skeleton itself without 
generating any text units only for reference.

Is there a way we can avoid this as this doesn't contain any text to be 
actually translated but its just being a reference ?

Original comment by 143.ravi...@gmail.com on 2 Apr 2013 at 11:57

GoogleCodeExporter commented 9 years ago
> Is it actually normal to have a text unit 
> generated for just having a reference to the CData

This should be resolved now as the issue #303 has been resolved for M21.

Original comment by yves.sav...@gmail.com on 15 Apr 2013 at 12:12

GoogleCodeExporter commented 9 years ago
I have recently migrated the whole project to use M20. Now for the above fix is 
it fine if I just update the artifact having the XMLStreamFilter class or do I 
need to pull in any other dependency. Is that the only change for the fix ?

Don't want to move the okapi core version to M21 again.

Original comment by 143.ravi...@gmail.com on 22 Apr 2013 at 3:28

GoogleCodeExporter commented 9 years ago
I would actually double-check that the issue #303 fix applies here as well.  
Issue #303 covered the PCDATA case, which I think may be a different code path.

Original comment by tingley on 23 Apr 2013 at 5:28

GoogleCodeExporter commented 9 years ago
The issue seems to be still there with CDATA. I used the latest M21 version of 
the following artifacts and still able the see the extra text unit(CDATA Place 
Holder).

1. okapi-filter-xmlstream  
2. okpai-filter-abstractmarkup
3. okapi-filter-html.

Original comment by 143.ravi...@gmail.com on 24 Apr 2013 at 2:44