computerline1z / okapi

Automatically exported from code.google.com/p/okapi
0 stars 0 forks source link

HTML markup exposed for translation #335

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
See attached testcase.  The expected behavior when extracting text with 
okf_html is to get a single segment that says "Translate this."  However, you 
actually get two segments, one of which is just a bunch of HTML markup:

<source xml:lang="en">&lt;table style="font-family: Arial; 
font-size:13px;padding-left:23px;" width="240px"; height="230px"; border=0; 
padding=0; margin=0; cellspacing=0; cellpadding=0; bgcolor="#cc0044"></source>

The reason for this is that the HTML is malformed (the semicolons after the 
attributes seem to be the culprits).  However, it's not badly malformed, and 
the page still renders fine in modern browsers.  Jericho has an internal error 
counter that tracks how many errors have been encountered in a single tag.  If 
the error count passes a certain threshold, Jericho gives up and emits the tag 
as text.  That's what's happening here.

The threshold in Jericho is configurable, and a simple improvement might be to 
set that threshold to something higher, so that it doesn't give up so easily on 
questionable tags.

Original issue reported on code.google.com by tingley on 8 May 2013 at 5:34

Attachments: