JFlex-based HTMLStripCharFilter replacement [LUCENE-3690]

asfimport commented 12 years ago

A JFlex-based HTMLStripCharFilter replacement would be more performant and easier to understand and maintain.

Migrated from LUCENE-3690 by Steven Rowe (@sarowe), 3 votes, resolved Jan 24 2012 Attachments: BaselineWarcTest.java, HTMLStripCharFilterWarcTest.java, jenkins_test.patch, JFlexHTMLStripCharFilterWarcTest.java, LUCENE-3690.patch (versions: 5), LUCENE-3690-handle-utf16-surrogates.patch Linked issues:

SOLR-882
- 3284
- SOLR-42

asfimport commented 12 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Patch implementing the idea.

This patch just creates a new class named JFlexHTMLStripCharFilter. All of the HTMLStripCharFilterTest tests are copied over to JFlexHTMLStripCharFilterTest; Robert's random test is un-@Ignore'd, and several more tests are added.

asfimport commented 12 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Left to do:

Accept supplementary characters in HTML tag names.
Recognize uppercase character entity variants for "quot", "copy", "gt", "lt", "reg", and "amp" (from Dawid Weiss's SOLR-882 patch)
Rename current HTMLStripCharFilter to ClassicHTMLStripCharFilter
Rename JFlexStripCharFilter to HTMLStripCharFilterImpl
Make a back-compat-enabling wrapper class (like StandardTokenizer) that will use ClassicHTMLStripCharFilter for Version.LUCENE_35 and earlier, and HTMLStripCharFilterImpl otherwise.

This would be the first back-compat enabled CharFilter. Should the existing HTMLStripCharFilter instead vanish off the face of the earth?

asfimport commented 12 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

This JFlex-based HTMLStripCharFilter replacement will fix #3284.

asfimport commented 12 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

One other thing left to do: I added javadocs to BaseCharFilter.addOffCorrectMap(), but I think they should be moved to the package javadocs for o.a.l.analysis.charfilter, and maybe add some examples. (This package has basically zero documentation currently.)

asfimport commented 12 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Added a <!\[CDATA\[...\]\]> section test, and fixed the CDATA section handling.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Robert's random test is un-@Ignore'd

+1!

asfimport commented 12 years ago

Brian Meidell (@brianmeidell) (migrated from JIRA)

Any idea how quickly this will be rolled into a new version?

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

This would be the first back-compat enabled CharFilter. Should the existing HTMLStripCharFilter instead vanish off the face of the earth?

What would be the motivation? Are there some features of the previous one that don't make sense in this implementation? I don't think fixing offsets bugs like #3284 counts as breaking index backwards compat, because it won't change search results. It will just prevent highlighters from throwing exceptions.

(LUCENE-3642 and #4742 fixed lots of offset bugs already, we didn't spend time on any back compat).

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Also I don't know how Version etc would work here, since the old HtmlStripCharFilter was never part of lucene.

from lucene's perspective, its a new feature.

asfimport commented 12 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Here's another to-do, from comment-12625835 in SOLR-42: handle MS-Word-generated broken processing instructions <? ... /> (instead of <? ... ?>.

asfimport commented 12 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

This would be the first back-compat enabled CharFilter.

What would be the motivation?

There are differences in the behavior, but I guess all of these could be characterized as bug fixes:

Supplementary characters in tags will be recognized. The old version doesn't do this.
CDATA sections are recognized. The old version doesn't; people have requested this, e.g. http://www.lucidimagination.com/search/document/48fcd906e39764ec#48fcd906e39764ec)
No space is substituted for inline tags (e.g. <b>, <i>, <span>). The old version substitutes spaces for all tags; people have complained e.g. on SOLR-1343
Broken MS-Word-generated processing instructions <? ... /> will be handled.
Uppercase character entities "quot", "copy", "gt", "lt", "reg", and "amp" will be recognized (from Dawid Weiss's SOLR-882 patch); the old version doesn't do this.

Are there some features of the previous one that don't make sense in this implementation?

No, not as far as I can tell. I think all features of the previous one are included.

Also I don't know how Version etc would work here, since the old HtmlStripCharFilter was never part of lucene. from lucene's perspective, its a new feature.

Good point. Should I make it a new Lucene feature on 3.X? That is, should I remove Solr's HTMLStripCharFilter and have it refer to a new Lucene HTMLStripCharFilter?

asfimport commented 12 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

AFAICT, SOLR-2891 will be fixed by this implementation.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

There are differences in the behavior, but I guess all of these could be characterized as bug fixes:

This sounds awesome!

Good point. Should I make it a new Lucene feature on 3.X? That is, should I remove Solr's HTMLStripCharFilter and have it refer to a new Lucene HTMLStripCharFilter?

I think so.

If you want to, you can make it package-private + deprecated inside o.a.solr.analysis. The Solr factory could respect the luceneMatchVersion parameter for backwards compatibility, and instantiate the old version in that case.

So you would then move the old charfilter from modules/analysis to the same place in trunk, too... its purely a solr back-compat issue. For lucene users its just a new feature and they don't see any of this.

This is what we did with Synonyms.

asfimport commented 12 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

This patch contains a feature-complete version. Changes from the previous patch:

Now substituting newlines instead of spaces for block-level elements; this corresponds more closely to on-screen layout, enables sentence segmentation, and doesn't change offsets.
Supplementary characters are now accepted in tags.
Switched accepted tag names from [:XID_Start:] and [:XID_Continue:] Unicode properties to the more relaxed [:ID_Start:] and [:ID_Continue:] properties, in order to broaden the range of recognizable input. (The improved security afforded by the XID_* properties is irrelevant to what a CharFilter does.)
Uppercase character entity variants for "quot", "copy", "gt", "lt", "reg", and "amp" (from Dawid Weiss's SOLR-882 patch) are now accepted.
MS-Word-generated broken processing instructions (<? ... /> instead of <? ... ?>) are now accepted.
Added several tests, including parsing a full MS-Word-2010-generated HTML file.

Left to do:

Move javadocs from BaseCharFilter.addOffCorrectMap() to o.a.l.analysis.charfilter package level javadoc file.
Rename the existing HTMLStripCharFilter to ClassicHTMLStripCharFilter; move it to Solr o.a.s.analysis package; deprecate it; and make it package private.
Rename JFlexHTMLStripCharFilter to HTMLStripCharFilter.
Enable Solr back-compat (but not Lucene back-compat, since HTMLStripCharFilter has never been released as part of Lucene) by making HTMLStripCharFilterFactory instantiate ClassicHTMLStripCharFilter if the luceneMatchVersion parameter is LUCENE_35 or earlier, and otherwise instantiate the new HTMLStripCharFilter.

asfimport commented 12 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Forgot to mention two more change in the latest patch from the previous version:

The generated scanner's parse method now returns the next available character if one is available; this simplifies/clarifies the processing flow. (Previously the parse method returned an enum indicating whether to copy a char directly from the input buffer – OutputSource.DIRECT – or from another location – OutputSource.INDIRECT – the read() method would then have to go fetch the next character from the given source.)
Renamed the generated scanner's parse method from the default yylex() to nextChar().

asfimport commented 12 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

I don't think fixing offsets bugs like #3284 counts as breaking index backwards compat, because it won't change search results.

bq. It will just prevent highlighters from throwing exceptions.

FWIW: If i understand the issue correctly, then the one risk i can imagine here is that people don't reindex, and get the new behavior for new docs, so they'll get diff behavior are query time depending on when the doc is re-indexed. that seems significant enough to definitely warrant the luceneMatchVersion toggle sarowe has on his todo list – which seems fairly straight forward.

The only concern i really have is...

A JFlex-based HTMLStripCharFilter replacement would be more performant...

..before deprecating "ClassicHTMLStripCharFilter" we should actually test that the Jlex version is faster ... because if it winds up being noticible slower in some cases, some people may prefer the the "classic" mode to the JFlex mode if the "warts" of the existing one don't affect them – in which case i might almost suggest actually using multiple factories in solr instead of making it versionMatch dependent.

(fingers crossed it's a non-issue)

asfimport commented 12 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Let's be conservative and keep around HTMLStripCharFilter, and name this one something else? The original was meant to handle all sorts of bad, partial, and weird input. Or are we really confident that this implementation handles everything the current one does?

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Let's be conservative and keep around HTMLStripCharFilter, and name this one something else? The original was meant to handle all sorts of bad, partial, and weird input.

I call bullshit. It fails on all kinds of bad/partial/and weird input.

In fact ill go remove my @Ignore now. Lets see if anyone can fix the bugs in it. That should settle this.

asfimport commented 12 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

are we really confident that this implementation handles everything the current one does?

Yes, this implementation's bad/partial/weird HTML handling capabilities are a superset of the current implementation. The tests for the new implementation are a superset of the old implementation's tests.

I welcome more examples of junk HTML to add to the tests :)

asfimport commented 12 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

The only concern i really have is...

A JFlex-based HTMLStripCharFilter replacement would be more performant...

..before deprecating "ClassicHTMLStripCharFilter" we should actually test that the Jlex version is faster ...

+1. I'll do this before committing anything.

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

I welcome more examples of junk HTML to add to the tests

Julien Nioche (Nutch) may have access to realistic large HTML crawls... there's nothing wilder and weirder than real-life HTML :)

asfimport commented 12 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

+1. I'll do this before committing anything.

i wouldn't be shy about committing the new impl + tests, i would just wait to change the solr factory default behavior until we prove the perf is as good as the existing one in some common cases, and if it's not, then re-evaluate the names of the classes.

and by common cases i'm thinking...

some test docs using typical wellformed html markup
some test docs using malformed markup that require backtracking
some test docs that contain almost no HTML at all (this is the one that i have a hunch may be a big differentiator – i've seen lots of people who use the HTML stripper not becuase they expect HTML, but because they want to be sure it doesn't get indexed if some stray html encoding sneaks into their data)

asfimport commented 12 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

The tests for the new implementation are a superset of the old implementation's tests.

Unfortunately I'm not sure how much of a story the tests tell (and yes, that would be my fault ;-) My memory is rusty, but back in '05 when I coded this thing, I threw a lot stuff we had lying around CNET at it, and also a lot of stuff downloaded from the web (which I couldn't just copy-n-paste into a unit test obviously). I had a heck of a time handling all the weird stuff that could appear inside script tags, for example, and I don't think I see much of a test for that (again... my fault.)

I welcome more examples of junk HTML to add to the tests

Not saying the new one isn't great (and matching a lot of crap from the old one is quite an achievement). One can be sure that the current implementation doesn't always do the right thing, but unfortunately "right" isn't well defined here considering the domain. The cost to keeping around the current version for a little while seems minimal.

asfimport commented 12 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

I had a heck of a time handling all the weird stuff that could appear inside script tags, for example, and I don't think I see much of a test for that (again... my fault.)

I added tests with the following snippets:

one<script><!-- <!--#config comment="<!-- \"comment\"-->"--> --></script>two

=> 'one
two'

one<script attr= bare><!-- action('<!-- comment -->', "\"-->\""); --></script>two

=> 'one
two'

hello<script><!-- f('<!--internal--></script>'); --></script>

=> 'hello
'

One can be sure that the current implementation doesn't always do the right thing, but unfortunately "right" isn't well defined here considering the domain.

I agree - the "right" thing IMHO is: get as much content from as varied a range of sources as possible, and never ever allow the input to bork processing.

The cost to keeping around the current version for a little while seems minimal.

My proposal would do this, though under a different name, and requiring a luceneMatchVersion of 3.5 or earlier for the factory to use it. Do you object to this?

I have two issues with not switching over now:

It's a chicken and egg problem: how will people know if there is a problem with the new implementation if they don't use it?
The current version has several long standing bugs that nobody has fixed. I personally wouldn't attempt it with the current implementation: it's difficult to understand. This is one of my main motivations for making this new version: when people find issues, fixing them should be much easier with the new implementation.

asfimport commented 12 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

From #lucene-dev IRC:

yonik_: sarowe: if we change the name of the current html strip stuff, it seems like doing anything with luceneMatchVersion isn't needed (or overkill?)

sarowe: but then people will need to take action to use the (non-broken) new impl me no likey

yonik_: but if the name is changed, they won't use it by accident

sarowe: oh, you mean: don't even attempt back-compat - just provide the ability to use the previous implementation

yonik_: yeah

sarowe: via a new/different name I'm definitely okay with that

asfimport commented 12 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

In fact ill go remove my @Ignore now. Lets see if anyone can fix the bugs in it. That should settle this.

I see you've done this. Purposely breaking the build really isn't necessary to make a point about a known bug.

asfimport commented 12 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

sarowe: oh, you mean: don't even attempt back-compat - just provide the ability to use the previous implementation

right, this is what we did with DateField a while back, note the CHANGES.txt entry on r658003. now that we have luceneMatchVersion though i kind of go back and forth on when to use it to pick an impl vs when to do stuff like this. dealers choice...

https://svn.apache.org/viewvc?view=revision&revision=658003

asfimport commented 12 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

some test docs using malformed markup that require backtracking

I set up a quick test for this in both of the test classes using output from the synthetic broken HTML generator o.a.l.util._TestUtil.randomHtmlishString() and ran 100K iterations of it - here's the HTMLStripCharFilterTest version:

  public void testRandomBrokenHTML() throws Exception {
    int maxNumElements = 10000;
    String text = _TestUtil.randomHtmlishString(random, maxNumElements);
    Reader reader
        = new HTMLStripCharFilter(CharReader.get(new StringReader(text)));
    while (reader.read() != -1);
  }

Best/worst of 5 (as reported by Ant for the individual test, rather than for the entire Ant invocation):

HTMLStripCharFilter: best: 73.4 sec, worst: 76.5 sec
JFlexHTMLStripCharFilter: best: 73.5 sec, worst: 76.0 sec

I'm going to increase the evilness of _TestUtil.randomHtmlishString() and re-run to see if the numbers shift.

asfimport commented 12 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

I'm going to increase the evilness of {{_TestUtil.randomHtmlishString()}} and re-run to see if the numbers shift.

I did this, and it uncovered a bug in handling of Server Side Includes in JFlexHTMLStripCharFilter - hooray for evil tests.

The timings, this time for 10K iterations:

HTMLStripCharFilter: best: 48.6 sec, worst: 49.7 sec
JFlexHTMLStripCharFilter: best: 15.5 sec, worst: 17.3 sec

asfimport commented 12 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

some test docs that contain almost no HTML at all

I used the following to test the zero-markup case in both test classes (the JFlexHTMLStripCharFilter version is given below):

  public void testRandomText() throws Exception {
    StringBuilder text = new StringBuilder();
    int minNumWords = 10;
    int maxNumWords = 10000;
    int minWordLength = 3;
    int maxWordLength = 20;
    int numWords = _TestUtil.nextInt(random, minNumWords, maxNumWords);
    switch (_TestUtil.nextInt(random, 0, 4)) {
      case 0: {
        for (int wordNum = 0 ; wordNum < numWords ; ++wordNum) {
          text.append(_TestUtil.randomUnicodeString(random, maxWordLength));
          text.append(' ');
        }
        break;
      }
      case 1: {
        for (int wordNum = 0 ; wordNum < numWords ; ++wordNum) {
          text.append(_TestUtil.randomRealisticUnicodeString
              (random, minWordLength, maxWordLength));
          text.append(' ');
        }
        break;
      }
      default: { // ASCII 50% of the time
        for (int wordNum = 0 ; wordNum < numWords ; ++wordNum) {
          text.append(_TestUtil.randomSimpleString(random));
          text.append(' ');
        }
      }
    }
    Reader reader = new JFlexHTMLStripCharFilter
        (CharReader.get(new StringReader(text.toString())));
    while (reader.read() != -1);
  }

The results for 10K iterations (best/worst of 5):

HTMLStripCharFilter: best: 23.7 sec, worst: 24.8 sec
JFlexHTMLStripCharFilter: best: 22.0 sec, worst: 24.7 sec

asfimport commented 12 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

some test docs using typical wellformed html markup

I have access to ClueWeb09. For performance testing I used the first WARC file for the English and Chinese languages (en0000/00.warc.gz and zh0000/00.warc.gz), each of which when uncompressed contains about 1GB of text (including a small amount of non-HTML metadata: WARC information and HTTP headers). The English WARC contains about 35,000 documents from about 2,100 unique domains. The Chinese WARC contains about 33,000 documents from about 550 unique domains.

I compared JFlexHTMLStripCharFilter's output with that of HTMLStripCharFilter for several hundred documents. In the course of this comparison, I found several problems with the JFlex implementation (e.g. no <STYLE> tag handling; no MS conditional tag handling, e.g. <!\[if ! IE]>; and some problems handling creative attribute values), which the attached patch fixes. I re-ran the text-only and malformed HTML performance tests on the final implementation, and the numbers aren't significantly different from those prior to these fixes. The new patch also contains the more-evil _TestUtils.randomHtmlishString(); shifts the CharFilter javadocs from BaseCharFilter.addOffCorrectMapping() to package.html; and adds several more tests to JFlexHTMLStripCharFilterTest.java.

I have attached the three classes I used to test performance over the ClueWeb09 subset. BaselineWarcTest.java uses the WarcRecord class supplied with the ClueWeb09 collection to read the compressed WARC files; looks for a declared charset first in each document's content in the Content-Type <meta> tag, and then in the HTTP header; feeds this charset, if any, to the ICU4J charset detector, which instantiates a Reader using the detected charset; and then read()'s all content. The other two classes add the respective CharFilter on top of BaselineWarcTest's functionality.

The performance numbers (best of 5 trials):

Language	Baseline	Classic	JFlex
English	156s	179s	171s
Chinese	155s	180s	172s

Excluding charset detection and I/O (measured by BaselineWarcTest), JFlexHTMLStripCharFilter appears to improve on HTMLStripCharFilter's throughput by about 50% in both languages.

I found a few problems with HTMLStripCharFilter:

The following exception was thrown for six of the English documents:

java.io.IOException: Mark invalid
       at java.io.BufferedReader.reset(BufferedReader.java:485)
       at org.apache.lucene.analysis.CharReader.reset(CharReader.java:69)
       at org.apache.lucene.analysis.charfilter.HTMLStripCharFilter.restoreState(HTMLStripCharFilter.java:171)
       at org.apache.lucene.analysis.charfilter.HTMLStripCharFilter.read(HTMLStripCharFilter.java:734)
       at HTMLStripCharFilterWarcTest.main(HTMLStripCharFilterWarcTest.java:86)

&apos; is not decoded.
Content between some <script> tags is not stripped out.
Unbalanced quotation marks in opening tags cause the tag to not be stripped out.

Left to do:

Rename HTMLStripCharFilter to ClassicHTMLStripCharFilter; move it to Solr o.a.s.analysis package; deprecate it; and create a new Solr Factory for it.
Rename JFlexHTMLStripCharFilter to HTMLStripCharFilter.
Commit to trunk
Backport and commit to branch_3x.

asfimport commented 12 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

+1 ... to everything.

asfimport commented 12 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

AFAICT, SOLR-2891 will be fixed by this implementation.

I misspoke, having misread that issue - despite the reference to HTMLStripCharFilter in the most recent comment on the issue, SOLR-2891 is not about HTMLStripCharFilter.

asfimport commented 12 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Here is the final patch.

sarowe: oh, you mean: don't even attempt back-compat - just provide the ability to use the previous implementation

right, this is what we did with DateField a while back, note the CHANGES.txt entry on r658003. now that we have luceneMatchVersion though i kind of go back and forth on when to use it to pick an impl vs when to do stuff like this. dealers choice...

https://svn\.apache\.org/viewvc?view=revision&revision=658003

I took the same approach - here are the changes from the previous version of the patch:

The previous HTMLStripCharFilter implementation is moved to Solr, renamed to LegacyHTMLStripCharFilter, and deprecated, and a Factory is added for it.
JFlexHTMLStripCharFilter is renamed to HTMLStripCharFilter.
Support for HTMLStripCharFilter's "escapedTags" functionality is added to HTMLStripCharFilterFactory.
Added TestHTMLStripCharFilterFactory.
Solr and Lucene CHANGES.txt entries are added.

Run the following svn copy script before applying the patch:

svn cp modules/analysis/common/src/java/org/apache/lucene/analysis/charfilter/HTMLStripCharFilter.java solr/core/src/java/org/apache/solr/analysis/LegacyHTMLStripCharFilter.java
svn cp modules/analysis/common/src/test/org/apache/lucene/analysis/charfilter/htmlStripReaderTest.html solr/core/src/test/org/apache/solr/analysis/
svn cp modules/analysis/common/src/test/org/apache/lucene/analysis/charfilter/HTMLStripCharFilterTest.java solr/core/src/test/org/apache/solr/analysis/LegacyHTMLStripCharFilterTest.java
svn cp solr/core/src/java/org/apache/solr/analysis/HTMLStripCharFilterFactory.java solr/core/src/java/org/apache/solr/analysis/LegacyHTMLStripCharFilterFactory.java

I plan to commit to trunk shortly, then backport and commit to branch_3x.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Here's a standalone testcase for the fail from jenkins

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

By the way: "expected" in that test is wrong. its just the same input string to trigger the assert in MockTokenizer...

asfimport commented 12 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Patch (excluding the re-generated .java scanner) that addresses the unpaired surrogate numeric character entity failures uncovered by random testing, by outputting REPLACEMENT CHARACTER U+FFFD, and adds the ability to interpret properly paired UTF-16 surrogates as an above-BMP codepoint. Added tests to cover all four combinations of hex & decimal surrogate numeric character entities in surrogate pairs.

Also added @SuppressWarnings("fallthrough") to the JFlex-generated scanner class, so that the 40+ warnings about switch case fall-throughs don't clutter the output.

Edit: committing to trunk shortly.

asfimport commented 12 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Committed the fixed UTF-16 numeric character reference surrogates handling to trunk in r1234687.

asfimport commented 12 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Thanks Robert for your help diagnosing and fixing the surrogates problem!

asfimport commented 12 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Backported to branch_3x.

apache / lucene

JFlex-based HTMLStripCharFilter replacement [LUCENE-3690] #4764

3284