apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.48k stars 983 forks source link

HTMLStripCharFilter += HTML5 [LUCENE-5763] #6825

Open asfimport opened 10 years ago

asfimport commented 10 years ago

HTMLStripCharFilter knows some specific things about HTML4 (like named character entities, which are converted to the corresponding characters), but not about HTML5.

HTML5 has way more named character entities: 2,231 vs 259 by my count.

There's probably other stuff to do, e.g. there are new tags.


Migrated from LUCENE-5763 by Steven Rowe (@sarowe), updated Jun 19 2014

asfimport commented 10 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Apparently the HTML5 named character entity set is almost a superset of HTML4's, but not quite: ⟨ and ⟩ expand to different characters. I don't think this blocks switching, just something that needs to be documented. Some background here: https://www.w3.org/Bugs/Public/show_bug.cgi?id=14429

asfimport commented 10 years ago

Shawn Heisey (@elyograg) (migrated from JIRA)

On the &amp;lang; and &amp;rang; difference: Will a filter like ICUFoldingFilter reduce these to the plain ascii <and > regardless of which entity interpretation is used? If so, it won't affect me ... and perhaps it might be something to mention in HTMLStripCharFilter javadocs.

Would it be useful at all to have a config option for the HTML version?

asfimport commented 10 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

On the {{&lang;}} and {{&rang;}} difference: Will a filter like ICUFoldingFilter reduce these to the plain ascii < and > regardless of which entity interpretation is used?

No, ICUFoldingFilter doesn't fold (leaves intact) the HTML5 &amp;lang;/&amp;rang; (left: U+27E8; right: U+27E9), but folds the HTML4 ones (left: U+2329; right: U+232A) to full-width CJK angle brackets U+3008 and U+3009, respectively... This 2007 WHATWG email mentions that earlier drafts of HTML5 mapped &amp;lang;/&amp;rang; to these full-width CJK characters.

And ASCIIFoldingFilter doesn't cover either of the blocks in question, so wouldn't fold any of these characters.

For text search, typically punctuation like this is stripped as part of the tokenization process, so I don't see the folding filters' deficits here as terribly problematic.

asfimport commented 10 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Would it be useful at all to have a config option for the HTML version?

I don't think so - the use for this thing is generally HTML you don't control (hence the ability to handle non-well-formed content), so it seems very unlikely that people will know which HTML version they should target. And I don't think we should have a mode where we output the HTML4 versions (left: U+2329; right: U+232A), because these characters are described in the Unicode specification as deprecated: from http://www.unicode.org/charts/PDF/U2300.pdf:

Deprecated angle brackets

These characters are deprecated and are strongly discouraged for mathematical use because of their canonical equivalence to CJK punctuation.

2329 〈 LEFT-POINTING ANGLE BRACKET [...] 232A 〉 RIGHT-POINTING ANGLE BRACKET