computerline1z / okapi

Automatically exported from code.google.com/p/okapi
0 stars 0 forks source link

tikal -m is unescaping #371

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1.
2.
3.

What is the expected output? What do you see instead?

What version of the product are you using? On what operating system?

Please provide any additional information below.

Original issue reported on code.google.com by johnt...@gmail.com on 16 Oct 2013 at 5:37

GoogleCodeExporter commented 9 years ago
Could you please provide more information about the issue?
-m is the Tikal command to merge. What is being un-escaped during the merge?
An example would be useful.
Thanks

Original comment by yves.sav...@gmail.com on 16 Oct 2013 at 5:41

GoogleCodeExporter commented 9 years ago
I suspect my issue didn't submit correctly with the examples as I got an error 
message...

So our source file we're translating has the following XLIFF markup:

this is a <bpt id="1"><b></bpt> small house <ept id="1"></b></ept>

we convert this to Moses inline format and get our translation:

this is a <g id="1">small house </g>

das ist ein <g id="1"> kleines haus </g>

Finally, we run 'tikal -m' to put the original xliff from the source back into 
the translated target and we get the following

this is a <bpt id="1"><b></bpt> small house <ept id="1"></b></ept>

The > entity has been unescaped back to the > character. Now, it comes to my 
attention that this may be intentional as we only need to escape the < in order 
to have valid XML�

I'd be grateful if you could elaborate

Original comment by johnt...@gmail.com on 16 Oct 2013 at 6:20

GoogleCodeExporter commented 9 years ago
First you may be using the wrong command. 'Merging' translations from Moses is 
done by leveraging the file you have prepared when using the -xm command. The 
leveraging is done with -lm (not -m)

See 
http://www.opentag.com/okapi/wiki/index.php?title=Tikal_-_Extraction_Commands#Me
rge_Files for more info.

In any case: for any XLIFF document:

"<bpt id="1"><b></bpt> small house <ept id="1"></b></ept>" and "<bpt 
id="1"><b></bpt> small house <ept id="1"></b></ept>" are identical from the XML 
parser viewpoint. As you noted: there is no need to escape the character '>'.

cheers,
-ys

Original comment by yves.sav...@gmail.com on 16 Oct 2013 at 6:55

GoogleCodeExporter commented 9 years ago
I tried the -lm command to the same effect, but not -xm. I'll try this next.

Anyway, maybe we can get away with not having to worry about this case
(will have to check with end users).

Thanks for your help
John

Original comment by johnt...@gmail.com on 16 Oct 2013 at 7:16