See attached testcase. The expected behavior when extracting text with
okf_html is to get a single segment that says "Translate this." However, you
actually get two segments, one of which is just a bunch of HTML markup:
<source xml:lang="en"><table style="font-family: Arial;
font-size:13px;padding-left:23px;" width="240px"; height="230px"; border=0;
padding=0; margin=0; cellspacing=0; cellpadding=0; bgcolor="#cc0044"></source>
The reason for this is that the HTML is malformed (the semicolons after the
attributes seem to be the culprits). However, it's not badly malformed, and
the page still renders fine in modern browsers. Jericho has an internal error
counter that tracks how many errors have been encountered in a single tag. If
the error count passes a certain threshold, Jericho gives up and emits the tag
as text. That's what's happening here.
The threshold in Jericho is configurable, and a simple improvement might be to
set that threshold to something higher, so that it doesn't give up so easily on
questionable tags.
Original issue reported on code.google.com by tingley on 8 May 2013 at 5:34
Original issue reported on code.google.com by
tingley
on 8 May 2013 at 5:34Attachments: