computerline1z / okapi

Automatically exported from code.google.com/p/okapi
0 stars 0 forks source link

XMLStreamFilter with HTMLSubfilter doesn't group back the XML tags correctly with M22 #339

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
My Source File is as follows -
<Solution>
<RESOLUTION>
<![CDATA[<p><li>Test</p></li>]]>
</RESOLUTION>

<DESCRIPTION>
<![CDATA[<p> Testing </p>]]>
</DESCRIPTION>
</Solution>

While merging back the file it merges CDATA outside the parent tag as follows -

<Solution>
<RESOLUTION></RESOLUTION>

<![CDATA[<p><li>Test</p></li>]]>

<DESCRIPTION></DESCRIPTION>
<![CDATA[<p> Testing </p>]]>
</Solution>

It used to work fine in earlier version (M20) but started occurring after 
updating the "okapi-filter-abstractmarkup" to "0.22-SNAPSHOT" for pulling the 
fix for the following ticket 
http://code.google.com/p/okapi/issues/detail?id=332.

Interestingly it happens only for tags defined as ruleTypes: [GROUP] in my yml 
file.

yml definition for the above tags are as follows -

  resolution:
    ruleTypes: [GROUP]
  description:
     ruleTypes: [GROUP]

Original issue reported on code.google.com by 143.ravi...@gmail.com on 15 May 2013 at 3:22

GoogleCodeExporter commented 9 years ago

Original comment by yves.sav...@gmail.com on 15 May 2013 at 3:34

GoogleCodeExporter commented 9 years ago

Original comment by tingley on 15 May 2013 at 8:47

GoogleCodeExporter commented 9 years ago
This is almost certainly something I've done wrong.

Original comment by tingley on 15 May 2013 at 9:15

GoogleCodeExporter commented 9 years ago
Is the fix available in the dev build now ? can I test it ?

Original comment by 143.ravi...@gmail.com on 28 May 2013 at 4:28

GoogleCodeExporter commented 9 years ago
Sorry, no, it's not fixed yet.

Original comment by tingley on 28 May 2013 at 6:55

GoogleCodeExporter commented 9 years ago
Hi ravikant,

Thanks for your patience.  I finally had a chance to look at this and... I'm 
afraid I may need more information from you.  I'm not able to reproduce this 
problem using a basic roundtrip test.  Here's what I did:
  * Copied your source file to a file called cdataWithGroup.xml (attached)
  * Created a filter config with your rules, okf_xmlstream@cdata.fprm (attached)

Then I ran two tikal commands to convert the source XML to XLIFF, and then back 
to XML:
  tikal.sh -fc okf_xmlstream\@cdata.fprm -x cdataWithGroup.xml
  tikal.sh -fc okf_xmlstream\@cdata.fprm -m cdataWithGroup.xml.xlf

This produces an output file (cdataWithGroup.out.xml) which I would expect to 
demonstrate the problem, if it were just a matter of the filter misbehaving.  
However, the output file looks fine to me.

So it seems that there's another factor involved which I will need to take into 
account in order to reproduce this.  Can you provide any more details about 
what you were doing to the source file after it had been segmented?  (ie, how 
was it translated?)

Thanks

Original comment by tingley on 5 Jun 2013 at 5:20

Attachments:

GoogleCodeExporter commented 9 years ago
Hi Tingley,

At my end with the M22 snap shot version of the "okapi-filter-abstractmarkup" 
jar the original issue of an spurious segment getting generated is fixed. It 
used t generate 1 for each CDATA tag.

Source file b.xml attached.

Now I do not see that getting generated anymore and the XLIFF output is also as 
expected.(de-DE.xlf).

While generating the XML file back there is a pipeline used which takes the 
original .xml file as the RawDocument and adds the following steps -
1. RawDocumentToFilterEventsStep()
2.driver.setFilterConfigurationMapper();
3. TranslateStep()
4. FilterEventsStreamWriterStep().

The translate step just updates each text unit targets with the appropriate 
localized strings 

This output of this pipe line  is the xml back where I see the tags getting 
misplaced.

Also I see 1 more difference in terms of the rules which u have set in the 
attached    okf_xmlstream@cdata.fprm -

I have used the "element" - 

global_cdata_subfilter: okf_html
preserve_whitespace: false

elements:
  solutions:
    ruleTypes: [INCLUDE]
  resolution:
    ruleTypes: [GROUP]
  description:
    ruleTypes: [GROUP]

but you seem to have used the "attributes"

global_cdata_subfilter: okf_html
preserve_whitespace: false
attributes:
resolution:
ruleTypes: [GROUP]
description:
ruleTypes: [GROUP]

Not sure if this too could be the difference in the output which we both are 
seeing.

Original comment by 143.ravi...@gmail.com on 6 Jun 2013 at 3:16

Attachments:

GoogleCodeExporter commented 9 years ago
Hi Tingley,

Did my comments help ? Were you able to reproduce at your side ?

Thanks

Original comment by 143.ravi...@gmail.com on 10 Jun 2013 at 2:14

GoogleCodeExporter commented 9 years ago
Hi both,

Just to confirm I'm getting the same output Ravi is getting with his 
configuration.
I have to admin I'm not sure about when to use GROUP and when to use TEXTUNIT 
though.
If using TEXTUNIT it merges back ok but it creates the extraneous empty xliff 
TextUnits.

Fredrik

Original comment by KFLi...@gmail.com on 10 Jun 2013 at 5:38

GoogleCodeExporter commented 9 years ago
Hi ravikant,

Yes, you're right, I had a mistake in my YML configuration.  Thanks for 
pointing that out.  I'm able to reproduce the problem now.

Fredrik: I agree, the semantics of several of the tag rules (including 
TEXTUNIT) are not very clear.

I assume that GROUP is intended to produce START_GROUP/END_GROUP events, which 
are used for example to produce <group> elements in XLIFF.  Looking at the 
XLIFF output from tikal, it looks like this issue may be related to the fact 
that subfiltering also always produces a group.  For example:

<group id="sg1">
<group id="sg1_ssf1" resname="sub-filter:sd1">
<trans-unit id="sg1_tu1" resname="sd1_1" restype="x-paragraph">
<source xml:lang="en"></source>
<target xml:lang="fr"></target>
</trans-unit>
<trans-unit id="sg1_tu2" resname="sd1_2" restype="x-li">
<source xml:lang="en">Test</source>
<target xml:lang="fr">Test</target>
</trans-unit>
</group>
</group>

Note the nested <group> elements.  XLIFF allows nested <group>, although it's 
not commonly used in my experience.  I wonder if this is confusing our merger.

I'll step through this.

Sorry for the slow progress, I've had almost no free time in the past few weeks.

Original comment by tingley on 14 Jun 2013 at 7:09

GoogleCodeExporter commented 9 years ago
This is just state confusion during the event generation.  The reference 
subfilter content isn't being correctly included in the skeleton for either of 
the group events.  Instead it gets left for the DOCUMENT_PART event that 
follows.  This moves the CDATA section outside of its parent element on 
reassembly.

Original comment by tingley on 14 Jun 2013 at 7:32

GoogleCodeExporter commented 9 years ago
I have checked in a fix and unittest to dev.  Commit is here:
https://code.google.com/p/okapi/source/detail?r=efa2b0935952a278304c4f9461ced664
e4d10b36&name=dev

ravikant, the next snapshot build should include the fix.  

Original comment by tingley on 14 Jun 2013 at 9:16

GoogleCodeExporter commented 9 years ago
Thanks a lot Tingley for looking into this.

Original comment by 143.ravi...@gmail.com on 19 Jun 2013 at 7:18