computerline1z / okapi

Automatically exported from code.google.com/p/okapi
0 stars 0 forks source link

Entity reference in XML Stream filter #394

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
The XML Stream filter seems to escape the ampersand of any entity reference.

<?xml version="1.0" ?>
<root>
<p>test &abcdef; text</p>
</root>

becomes:

<?xml version="1.0" ?>
<root>
<p>test &amp;abcdef; text</p>
</root>

When the entity is declared it also duplicates the content of the declaraion:

<?xml version="1.0" ?>
<!DOCTYPE root [
  <!ENTITY abcdef "ABCDEF">
]>
<root>
<p>test &abcdef; text</p>
</root>

becomes:

<?xml version="1.0" ?>
<!DOCTYPE root [
  <!ENTITY abcdef "ABCDEF">
]><!ENTITY abcdef "ABCDEF">
<root>
<p>test &amp;abcdef; text</p>
</root>

Original issue reported on code.google.com by yves.sav...@gmail.com on 5 Mar 2014 at 5:43

GoogleCodeExporter commented 9 years ago
Further, when I pre-process XML files with okf_xmlstream filter, some chars 
("&", "<" and so on) would be escaped to .their entity's name.
When post-process their translated version, it is reverse. I mean that these 
entity's name (include those in origin XML files) would be replaced by 
themselves.
For more details about such chars, please refer to: 
http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references#Pr
edefined_entities_in_XML

I use CodeFinder in the filter:
useCodeFinder: true
codeFinderRules: |-
  #v1
  count.i=1
  rule0=(&[^;]+?;|&|<|>|'|")
But the bug is still there.

Original comment by rhong...@gmail.com on 6 Mar 2014 at 1:19

GoogleCodeExporter commented 9 years ago

Original comment by yves.sav...@gmail.com on 7 Mar 2014 at 5:12

GoogleCodeExporter commented 9 years ago
For the first case (&abcdef;), there's a question of what the correct behavior 
should be.  I think there's a good argument to be made that the entity should 
be exposed for translation as a placeholder, since we probably don't know what 
it represents (the entity declaration may not even always be available).  Most 
of time, the entities are used for content parameterization (product name, 
etc), in which case you don't want the translators to be able to mess with 
them.  So protecting the entity by converting it automatically to the code 
seems like the best behavior to me.

Do others agree?

Original comment by tingley on 7 Mar 2014 at 7:19

GoogleCodeExporter commented 9 years ago
It makes sense to treat it as an inline code. We should have a type for 
this added to our global types: "entity".

J

Original comment by jhargrav...@gmail.com on 7 Mar 2014 at 7:24

GoogleCodeExporter commented 9 years ago
+1. The XML Filter does that by default.
It has an option to expand the entities otherwise, but that should be very 
rarely used.

Original comment by yves.sav...@gmail.com on 7 Mar 2014 at 7:56

GoogleCodeExporter commented 9 years ago
Hi, tingley. I think it would be a good way to process such entities as you 
said.

Original comment by rhong...@gmail.com on 8 Mar 2014 at 12:48

GoogleCodeExporter commented 9 years ago
I fixed this issue:

<?xml version="1.0" ?>
<!DOCTYPE root [
  <!ENTITY abcdef "ABCDEF">
]><!ENTITY abcdef "ABCDEF">
<root>
<p>test &abcdef; text</p>
</root>

Original comment by jhargrav...@gmail.com on 12 Mar 2014 at 8:48

GoogleCodeExporter commented 9 years ago
Jim, can we resolve this?

Original comment by tingley on 7 Apr 2014 at 11:53

GoogleCodeExporter commented 9 years ago
You mean the second problem of escaping the entity reference? I won't be able 
to get to this for a while. Maybe in a week or so.

Original comment by jhargrav...@gmail.com on 8 Apr 2014 at 1:23

GoogleCodeExporter commented 9 years ago
Oh my mistake - I meant could we close the bug, I didn't realize there were two 
separate issues. 

Original comment by tingley on 8 Apr 2014 at 2:02