Closed asfimport closed 12 years ago
Steven Rowe (@sarowe) (migrated from JIRA)
Patch implementing the idea.
This patch just creates a new class named JFlexHTMLStripCharFilter
. All of the HTMLStripCharFilterTest
tests are copied over to JFlexHTMLStripCharFilterTest
; Robert's random test is un-@Ignore
'd, and several more tests are added.
Steven Rowe (@sarowe) (migrated from JIRA)
Left to do:
HTMLStripCharFilter
to ClassicHTMLStripCharFilter
JFlexStripCharFilter
to HTMLStripCharFilterImpl
StandardTokenizer
) that will use ClassicHTMLStripCharFilter
for Version.LUCENE_35 and earlier, and HTMLStripCharFilterImpl
otherwise.This would be the first back-compat enabled CharFilter. Should the existing HTMLStripCharFilter instead vanish off the face of the earth?
Steven Rowe (@sarowe) (migrated from JIRA)
This JFlex-based HTMLStripCharFilter
replacement will fix #3284.
Steven Rowe (@sarowe) (migrated from JIRA)
One other thing left to do: I added javadocs to BaseCharFilter.addOffCorrectMap()
, but I think they should be moved to the package javadocs for o.a.l.analysis.charfilter
, and maybe add some examples. (This package has basically zero documentation currently.)
Steven Rowe (@sarowe) (migrated from JIRA)
Added a <!\[CDATA\[...\]\]>
section test, and fixed the CDATA section handling.
Brian Meidell (@brianmeidell) (migrated from JIRA)
Any idea how quickly this will be rolled into a new version?
Robert Muir (@rmuir) (migrated from JIRA)
This would be the first back-compat enabled CharFilter. Should the existing HTMLStripCharFilter instead vanish off the face of the earth?
What would be the motivation? Are there some features of the previous one that don't make sense in this implementation? I don't think fixing offsets bugs like #3284 counts as breaking index backwards compat, because it won't change search results. It will just prevent highlighters from throwing exceptions.
(LUCENE-3642 and #4742 fixed lots of offset bugs already, we didn't spend time on any back compat).
Robert Muir (@rmuir) (migrated from JIRA)
Also I don't know how Version etc would work here, since the old HtmlStripCharFilter was never part of lucene.
from lucene's perspective, its a new feature.
Steven Rowe (@sarowe) (migrated from JIRA)
Here's another to-do, from comment-12625835 in SOLR-42: handle MS-Word-generated broken processing instructions <? ... />
(instead of <? ... ?>
.
Steven Rowe (@sarowe) (migrated from JIRA)
This would be the first back-compat enabled CharFilter.
What would be the motivation?
There are differences in the behavior, but I guess all of these could be characterized as bug fixes:
<b>
, <i>
, <span>
). The old version substitutes spaces for all tags; people have complained e.g. on SOLR-1343<? ... />
will be handled.Are there some features of the previous one that don't make sense in this implementation?
No, not as far as I can tell. I think all features of the previous one are included.
Also I don't know how Version etc would work here, since the old HtmlStripCharFilter was never part of lucene. from lucene's perspective, its a new feature.
Good point. Should I make it a new Lucene feature on 3.X? That is, should I remove Solr's HTMLStripCharFilter and have it refer to a new Lucene HTMLStripCharFilter?
Steven Rowe (@sarowe) (migrated from JIRA)
AFAICT, SOLR-2891 will be fixed by this implementation.
Robert Muir (@rmuir) (migrated from JIRA)
There are differences in the behavior, but I guess all of these could be characterized as bug fixes:
This sounds awesome!
Good point. Should I make it a new Lucene feature on 3.X? That is, should I remove Solr's HTMLStripCharFilter and have it refer to a new Lucene HTMLStripCharFilter?
I think so.
If you want to, you can make it package-private + deprecated inside o.a.solr.analysis. The Solr factory could respect the luceneMatchVersion parameter for backwards compatibility, and instantiate the old version in that case.
So you would then move the old charfilter from modules/analysis to the same place in trunk, too... its purely a solr back-compat issue. For lucene users its just a new feature and they don't see any of this.
This is what we did with Synonyms.
Steven Rowe (@sarowe) (migrated from JIRA)
This patch contains a feature-complete version. Changes from the previous patch:
[:XID_Start:]
and [:XID_Continue:]
Unicode properties to the more relaxed [:ID_Start:]
and [:ID_Continue:]
properties, in order to broaden the range of recognizable input. (The improved security afforded by the XID_*
properties is irrelevant to what a CharFilter
does.)<? ... />
instead of <? ... ?>
) are now accepted.Left to do:
BaseCharFilter.addOffCorrectMap()
to o.a.l.analysis.charfilter
package level javadoc file.HTMLStripCharFilter
to ClassicHTMLStripCharFilter
; move it to Solr o.a.s.analysis
package; deprecate it; and make it package private.JFlexHTMLStripCharFilter
to HTMLStripCharFilter
.HTMLStripCharFilter
has never been released as part of Lucene) by making HTMLStripCharFilterFactory
instantiate ClassicHTMLStripCharFilter
if the luceneMatchVersion
parameter is LUCENE_35
or earlier, and otherwise instantiate the new HTMLStripCharFilter
.Steven Rowe (@sarowe) (migrated from JIRA)
Forgot to mention two more change in the latest patch from the previous version:
OutputSource.DIRECT
– or from another location – OutputSource.INDIRECT
– the read()
method would then have to go fetch the next character from the given source.)yylex()
to nextChar()
.Chris M. Hostetter (@hossman) (migrated from JIRA)
I don't think fixing offsets bugs like #3284 counts as breaking index backwards compat, because it won't change search results.
bq. It will just prevent highlighters from throwing exceptions.
FWIW: If i understand the issue correctly, then the one risk i can imagine here is that people don't reindex, and get the new behavior for new docs, so they'll get diff behavior are query time depending on when the doc is re-indexed. that seems significant enough to definitely warrant the luceneMatchVersion toggle sarowe has on his todo list – which seems fairly straight forward.
The only concern i really have is...
A JFlex-based HTMLStripCharFilter replacement would be more performant...
..before deprecating "ClassicHTMLStripCharFilter" we should actually test that the Jlex version is faster ... because if it winds up being noticible slower in some cases, some people may prefer the the "classic" mode to the JFlex mode if the "warts" of the existing one don't affect them – in which case i might almost suggest actually using multiple factories in solr instead of making it versionMatch dependent.
(fingers crossed it's a non-issue)
Yonik Seeley (@yonik) (migrated from JIRA)
Let's be conservative and keep around HTMLStripCharFilter, and name this one something else? The original was meant to handle all sorts of bad, partial, and weird input. Or are we really confident that this implementation handles everything the current one does?
Robert Muir (@rmuir) (migrated from JIRA)
Let's be conservative and keep around HTMLStripCharFilter, and name this one something else? The original was meant to handle all sorts of bad, partial, and weird input.
I call bullshit. It fails on all kinds of bad/partial/and weird input.
In fact ill go remove my @Ignore
now. Lets see if anyone can fix the bugs in it. That should settle this.
Steven Rowe (@sarowe) (migrated from JIRA)
are we really confident that this implementation handles everything the current one does?
Yes, this implementation's bad/partial/weird HTML handling capabilities are a superset of the current implementation. The tests for the new implementation are a superset of the old implementation's tests.
I welcome more examples of junk HTML to add to the tests :)
Steven Rowe (@sarowe) (migrated from JIRA)
The only concern i really have is...
A JFlex-based HTMLStripCharFilter replacement would be more performant...
..before deprecating "ClassicHTMLStripCharFilter" we should actually test that the Jlex version is faster ...
+1. I'll do this before committing anything.
Dawid Weiss (@dweiss) (migrated from JIRA)
I welcome more examples of junk HTML to add to the tests
Julien Nioche (Nutch) may have access to realistic large HTML crawls... there's nothing wilder and weirder than real-life HTML :)
Chris M. Hostetter (@hossman) (migrated from JIRA)
+1. I'll do this before committing anything.
i wouldn't be shy about committing the new impl + tests, i would just wait to change the solr factory default behavior until we prove the perf is as good as the existing one in some common cases, and if it's not, then re-evaluate the names of the classes.
and by common cases i'm thinking...
Yonik Seeley (@yonik) (migrated from JIRA)
The tests for the new implementation are a superset of the old implementation's tests.
Unfortunately I'm not sure how much of a story the tests tell (and yes, that would be my fault ;-) My memory is rusty, but back in '05 when I coded this thing, I threw a lot stuff we had lying around CNET at it, and also a lot of stuff downloaded from the web (which I couldn't just copy-n-paste into a unit test obviously). I had a heck of a time handling all the weird stuff that could appear inside script tags, for example, and I don't think I see much of a test for that (again... my fault.)
I welcome more examples of junk HTML to add to the tests
Not saying the new one isn't great (and matching a lot of crap from the old one is quite an achievement). One can be sure that the current implementation doesn't always do the right thing, but unfortunately "right" isn't well defined here considering the domain. The cost to keeping around the current version for a little while seems minimal.
Steven Rowe (@sarowe) (migrated from JIRA)
I had a heck of a time handling all the weird stuff that could appear inside script tags, for example, and I don't think I see much of a test for that (again... my fault.)
I added tests with the following snippets:
one<script><!-- <!--#config comment="<!-- \"comment\"-->"--> --></script>two
=> 'one
two'
one<script attr= bare><!-- action('<!-- comment -->', "\"-->\""); --></script>two
=> 'one
two'
hello<script><!-- f('<!--internal--></script>'); --></script>
=> 'hello
'
One can be sure that the current implementation doesn't always do the right thing, but unfortunately "right" isn't well defined here considering the domain.
I agree - the "right" thing IMHO is: get as much content from as varied a range of sources as possible, and never ever allow the input to bork processing.
The cost to keeping around the current version for a little while seems minimal.
My proposal would do this, though under a different name, and requiring a luceneMatchVersion of 3.5 or earlier for the factory to use it. Do you object to this?
I have two issues with not switching over now:
Steven Rowe (@sarowe) (migrated from JIRA)
From #lucene-dev IRC:
yonik_: sarowe: if we change the name of the current html strip stuff, it seems like doing anything with luceneMatchVersion isn't needed (or overkill?)
sarowe: but then people will need to take action to use the (non-broken) new impl me no likey
yonik_: but if the name is changed, they won't use it by accident
sarowe: oh, you mean: don't even attempt back-compat - just provide the ability to use the previous implementation
yonik_: yeah
sarowe: via a new/different name I'm definitely okay with that
Yonik Seeley (@yonik) (migrated from JIRA)
In fact ill go remove my
@Ignore
now. Lets see if anyone can fix the bugs in it. That should settle this.
I see you've done this. Purposely breaking the build really isn't necessary to make a point about a known bug.
Chris M. Hostetter (@hossman) (migrated from JIRA)
sarowe: oh, you mean: don't even attempt back-compat - just provide the ability to use the previous implementation
right, this is what we did with DateField a while back, note the CHANGES.txt entry on r658003. now that we have luceneMatchVersion though i kind of go back and forth on when to use it to pick an impl vs when to do stuff like this. dealers choice...
Steven Rowe (@sarowe) (migrated from JIRA)
some test docs using malformed markup that require backtracking
I set up a quick test for this in both of the test classes using output from the synthetic broken HTML generator o.a.l.util._TestUtil.randomHtmlishString()
and ran 100K iterations of it - here's the HTMLStripCharFilterTest version:
public void testRandomBrokenHTML() throws Exception {
int maxNumElements = 10000;
String text = _TestUtil.randomHtmlishString(random, maxNumElements);
Reader reader
= new HTMLStripCharFilter(CharReader.get(new StringReader(text)));
while (reader.read() != -1);
}
Best/worst of 5 (as reported by Ant for the individual test, rather than for the entire Ant invocation):
HTMLStripCharFilter
: best: 73.4 sec, worst: 76.5 secJFlexHTMLStripCharFilter
: best: 73.5 sec, worst: 76.0 secI'm going to increase the evilness of _TestUtil.randomHtmlishString()
and re-run to see if the numbers shift.
Steven Rowe (@sarowe) (migrated from JIRA)
I'm going to increase the evilness of {{_TestUtil.randomHtmlishString()}} and re-run to see if the numbers shift.
I did this, and it uncovered a bug in handling of Server Side Includes in JFlexHTMLStripCharFilter
- hooray for evil tests.
The timings, this time for 10K iterations:
HTMLStripCharFilter
: best: 48.6 sec, worst: 49.7 secJFlexHTMLStripCharFilter
: best: 15.5 sec, worst: 17.3 secSteven Rowe (@sarowe) (migrated from JIRA)
some test docs that contain almost no HTML at all
I used the following to test the zero-markup case in both test classes (the JFlexHTMLStripCharFilter
version is given below):
public void testRandomText() throws Exception {
StringBuilder text = new StringBuilder();
int minNumWords = 10;
int maxNumWords = 10000;
int minWordLength = 3;
int maxWordLength = 20;
int numWords = _TestUtil.nextInt(random, minNumWords, maxNumWords);
switch (_TestUtil.nextInt(random, 0, 4)) {
case 0: {
for (int wordNum = 0 ; wordNum < numWords ; ++wordNum) {
text.append(_TestUtil.randomUnicodeString(random, maxWordLength));
text.append(' ');
}
break;
}
case 1: {
for (int wordNum = 0 ; wordNum < numWords ; ++wordNum) {
text.append(_TestUtil.randomRealisticUnicodeString
(random, minWordLength, maxWordLength));
text.append(' ');
}
break;
}
default: { // ASCII 50% of the time
for (int wordNum = 0 ; wordNum < numWords ; ++wordNum) {
text.append(_TestUtil.randomSimpleString(random));
text.append(' ');
}
}
}
Reader reader = new JFlexHTMLStripCharFilter
(CharReader.get(new StringReader(text.toString())));
while (reader.read() != -1);
}
The results for 10K iterations (best/worst of 5):
HTMLStripCharFilter
: best: 23.7 sec, worst: 24.8 secJFlexHTMLStripCharFilter
: best: 22.0 sec, worst: 24.7 secSteven Rowe (@sarowe) (migrated from JIRA)
some test docs using typical wellformed html markup
I have access to ClueWeb09. For performance testing I used the first WARC file for the English and Chinese languages (en0000/00.warc.gz
and zh0000/00.warc.gz
), each of which when uncompressed contains about 1GB of text (including a small amount of non-HTML metadata: WARC information and HTTP headers). The English WARC contains about 35,000 documents from about 2,100 unique domains. The Chinese WARC contains about 33,000 documents from about 550 unique domains.
I compared JFlexHTMLStripCharFilter
's output with that of HTMLStripCharFilter
for several hundred documents. In the course of this comparison, I found several problems with the JFlex implementation (e.g. no <STYLE>
tag handling; no MS conditional tag handling, e.g. <!\[if ! IE]>
; and some problems handling creative attribute values), which the attached patch fixes. I re-ran the text-only and malformed HTML performance tests on the final implementation, and the numbers aren't significantly different from those prior to these fixes. The new patch also contains the more-evil _TestUtils.randomHtmlishString()
; shifts the CharFilter
javadocs from BaseCharFilter.addOffCorrectMapping()
to package.html
; and adds several more tests to JFlexHTMLStripCharFilterTest.java
.
I have attached the three classes I used to test performance over the ClueWeb09 subset. BaselineWarcTest.java
uses the WarcRecord class supplied with the ClueWeb09 collection to read the compressed WARC files; looks for a declared charset first in each document's content in the Content-Type <meta>
tag, and then in the HTTP header; feeds this charset, if any, to the ICU4J charset detector, which instantiates a Reader using the detected charset; and then read()'s
all content. The other two classes add the respective CharFilter on top of BaselineWarcTest
's functionality.
The performance numbers (best of 5 trials):
Language | Baseline | Classic | JFlex |
---|---|---|---|
English | 156s | 179s | 171s |
Chinese | 155s | 180s | 172s |
Excluding charset detection and I/O (measured by BaselineWarcTest
), JFlexHTMLStripCharFilter
appears to improve on HTMLStripCharFilter
's throughput by about 50% in both languages.
I found a few problems with HTMLStripCharFilter
:
java.io.IOException: Mark invalid
at java.io.BufferedReader.reset(BufferedReader.java:485)
at org.apache.lucene.analysis.CharReader.reset(CharReader.java:69)
at org.apache.lucene.analysis.charfilter.HTMLStripCharFilter.restoreState(HTMLStripCharFilter.java:171)
at org.apache.lucene.analysis.charfilter.HTMLStripCharFilter.read(HTMLStripCharFilter.java:734)
at HTMLStripCharFilterWarcTest.main(HTMLStripCharFilterWarcTest.java:86)
&apos;
is not decoded.Left to do:
HTMLStripCharFilter
to ClassicHTMLStripCharFilter
; move it to Solr o.a.s.analysis
package; deprecate it; and create a new Solr Factory for it.Chris M. Hostetter (@hossman) (migrated from JIRA)
+1 ... to everything.
Steven Rowe (@sarowe) (migrated from JIRA)
AFAICT, SOLR-2891 will be fixed by this implementation.
I misspoke, having misread that issue - despite the reference to HTMLStripCharFilter
in the most recent comment on the issue, SOLR-2891 is not about HTMLStripCharFilter
.
Steven Rowe (@sarowe) (migrated from JIRA)
Here is the final patch.
sarowe: oh, you mean: don't even attempt back-compat - just provide the ability to use the previous implementation
right, this is what we did with DateField a while back, note the CHANGES.txt entry on r658003. now that we have luceneMatchVersion though i kind of go back and forth on when to use it to pick an impl vs when to do stuff like this. dealers choice...
https://svn\.apache\.org/viewvc?view=revision&revision=658003
I took the same approach - here are the changes from the previous version of the patch:
HTMLStripCharFilter
implementation is moved to Solr, renamed to LegacyHTMLStripCharFilter
, and deprecated, and a Factory is added for it.JFlexHTMLStripCharFilter
is renamed to HTMLStripCharFilter
.HTMLStripCharFilter
's "escapedTags" functionality is added to HTMLStripCharFilterFactory
.TestHTMLStripCharFilterFactory
.CHANGES.txt
entries are added.Run the following svn copy script before applying the patch:
svn cp modules/analysis/common/src/java/org/apache/lucene/analysis/charfilter/HTMLStripCharFilter.java solr/core/src/java/org/apache/solr/analysis/LegacyHTMLStripCharFilter.java
svn cp modules/analysis/common/src/test/org/apache/lucene/analysis/charfilter/htmlStripReaderTest.html solr/core/src/test/org/apache/solr/analysis/
svn cp modules/analysis/common/src/test/org/apache/lucene/analysis/charfilter/HTMLStripCharFilterTest.java solr/core/src/test/org/apache/solr/analysis/LegacyHTMLStripCharFilterTest.java
svn cp solr/core/src/java/org/apache/solr/analysis/HTMLStripCharFilterFactory.java solr/core/src/java/org/apache/solr/analysis/LegacyHTMLStripCharFilterFactory.java
I plan to commit to trunk shortly, then backport and commit to branch_3x.
Robert Muir (@rmuir) (migrated from JIRA)
Here's a standalone testcase for the fail from jenkins
Robert Muir (@rmuir) (migrated from JIRA)
By the way: "expected" in that test is wrong. its just the same input string to trigger the assert in MockTokenizer...
Steven Rowe (@sarowe) (migrated from JIRA)
Patch (excluding the re-generated .java scanner) that addresses the unpaired surrogate numeric character entity failures uncovered by random testing, by outputting REPLACEMENT CHARACTER U+FFFD, and adds the ability to interpret properly paired UTF-16 surrogates as an above-BMP codepoint. Added tests to cover all four combinations of hex & decimal surrogate numeric character entities in surrogate pairs.
Also added @SuppressWarnings("fallthrough")
to the JFlex-generated scanner class, so that the 40+ warnings about switch case fall-throughs don't clutter the output.
Edit: committing to trunk shortly.
Steven Rowe (@sarowe) (migrated from JIRA)
Committed the fixed UTF-16 numeric character reference surrogates handling to trunk in r1234687.
Steven Rowe (@sarowe) (migrated from JIRA)
Thanks Robert for your help diagnosing and fixing the surrogates problem!
Steven Rowe (@sarowe) (migrated from JIRA)
Backported to branch_3x.
A JFlex-based HTMLStripCharFilter replacement would be more performant and easier to understand and maintain.
Migrated from LUCENE-3690 by Steven Rowe (@sarowe), 3 votes, resolved Jan 24 2012 Attachments: BaselineWarcTest.java, HTMLStripCharFilterWarcTest.java, jenkins_test.patch, JFlexHTMLStripCharFilterWarcTest.java, LUCENE-3690.patch (versions: 5), LUCENE-3690-handle-utf16-surrogates.patch Linked issues:
3284