apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.64k stars 1.02k forks source link

ASCIIFoldingFilter.foldToASCII performance issue due to large compiled method size [LUCENE-7525] #8576

Open asfimport opened 7 years ago

asfimport commented 7 years ago

The ASCIIFoldingFilter.foldToASCII method has an enormous switch statement and is too large for the HotSpot compiler to compile; causing a performance problem.

The method is about 13K compiled, versus the 8KB HotSpot limit. So splitting the method in half works around the problem.

In my tests splitting the method in half resulted in a 5X performance increase.

In the test code below you can see how slow the fold method is, even when it is using the shortcut when the character is less than 0x80, compared to an inline implementation of the same shortcut.

So a workaround is to split the method. I'm happy to provide a patch. It's a hack, of course. Perhaps using the MappingCharFilterFactory with an input file as per SOLR-2013 would be a better replacement for this method in this class?

public class ASCIIFoldingFilterPerformanceTest {

    private static final int ITERATIONS = 1_000_000;

    `@Test`
    public void testFoldShortString() {
        char[] input = "testing".toCharArray();
        char[] output = new char[input.length * 4];

        for (int i = 0; i < ITERATIONS; i++) {
            ASCIIFoldingFilter.foldToASCII(input, 0, output, 0, input.length);
        }
    }

    `@Test`
    public void testFoldShortAccentedString() {
        char[] input = "éúéúøßüäéúéúøßüä".toCharArray();
        char[] output = new char[input.length * 4];

        for (int i = 0; i < ITERATIONS; i++) {
            ASCIIFoldingFilter.foldToASCII(input, 0, output, 0, input.length);
        }
    }

    `@Test`
    public void testManualFoldTinyString() {
        char[] input = "t".toCharArray();
        char[] output = new char[input.length * 4];

        for (int i = 0; i < ITERATIONS; i++) {
            int k = 0;
            for (int j = 0; j < 1; ++j) {
                final char c = input[j];
                if (c < '\u0080') {
                    output[k++] = c;
                } else {
                    Assert.assertTrue(false);
                }
            }
        }
    }
}

Migrated from LUCENE-7525 by Karl von Randow, 1 vote, updated Mar 26 2017 Attachments: ASCIIFolding.java, ASCIIFoldingFilter.java, LUCENE-7525.patch (versions: 2), TestASCIIFolding.java

asfimport commented 7 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

This is probably one of the most used token filters, so +1 to explore if/how we can make it faster. Maybe we should add an analyzer that has case folding enabled to http://people.apache.org/\~mikemccand/lucenebench/analyzers.html.

asfimport commented 7 years ago

Michael Braun (@michaelbraun) (migrated from JIRA)

Was just profiling indexing yesterday on our side and noticed the exact same thing - would love to see the performance of this method improved!

asfimport commented 7 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I'd make this huge switch statement a different lookup structure, e.g., a HashMap or similar (possibly/maybe with Java9 Lambdas as values).

asfimport commented 7 years ago

Karl von Randow (migrated from JIRA)

I have created a version that uses static arrays and a binary search. This performs similarly to splitting the existing code into two methods; splitting the switch cases between the methods, and delegating to the second method from the first's default: block. Not surprising, as the switch of that size (and breadth of values) is a binary search, but native.

I'll try to attach the code examples of the two approaches. These two approaches are perhaps a simpler change than going to a new lookup structure, and I think will be more performant (as they're a binary search over an array)?

The simplest change is the switch-split. It will also be the most attractive in the Git history (as the method is literally split in two about half-way through the case statements). It will also be the easiest to prove that the behaviour is the same as before.

asfimport commented 7 years ago

Karl von Randow (migrated from JIRA)

An example of a binary search over an array data structure as an alternative for the large switch statement.

asfimport commented 7 years ago

Karl von Randow (migrated from JIRA)

Attached ASCIIFoldingFilter.java with an example of splitting the switch statement into two methods to come in within the compile limit.

asfimport commented 7 years ago

Ahmet Arslan (@iorixxx) (migrated from JIRA)

Can workings of ICUFoldingFilter give any insight here?

asfimport commented 7 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I'd suggest to use the simple binary search approach, but without generated code. I'd suggest to convert the large switch statement once to a simple text file and load it as resource in static initializer.

This allows to maybe further extend the folding filter so people can use their own mappings by pointing to an input stream!

asfimport commented 7 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Maybe an FST, like MappingCharFilter?

asfimport commented 7 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I am not sure if an FST is not like to use a sledgehammer to crack a nut :-) We just need a lookup for single ints in a for-loop and replace those ints with a sequence of other ints.

I will check the ICU source code to se what they are doing.

asfimport commented 7 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I think, we can for now replace the large switch statement with a resource file. I'd have 2 ideas:

The actual code that parses the file and converts to the "lookup table" could be replaced easily afterwards. I'd start with a binary lookup as suggested (similar to the switch statement's internal impl).

asfimport commented 7 years ago

Alexandre Rafalovitch (@arafalov) (migrated from JIRA)

I thought the ASCII Filter was basically a legacy filter and it grew into MappingCharFilterFactory. Now, we are talking about making ASCII Filter even more like Mapping Char Filter. Why would we need both then? I know one is Token Filter and another Char filter, but I am not sure it is sufficiently important for this discussion.

asfimport commented 7 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I thought the ASCII Filter was basically a legacy filter and it grew into MappingCharFilter

I generally prefer ASCIIFoldingFilter because it allows you much more flexibility. The problem with MappingCharFilter is that you can only apply it before tokenization. And this is the major downside: If you have tokenizers that rely on language specific stuff (Asian like Japanese, Chinese,...) it is a bad idea to do such stuff like before the tokenization. Also if you do stemming: Stemming in France breaks if you remove accents before! So ASCIIFoldingFilter is in most analysis chains the very last filter!

asfimport commented 7 years ago

Karl von Randow (migrated from JIRA)

It seemed to me that we could try to generalise the FST creation from MappingCharFilter and use it in both places; therefore retaining the existing differences between the two classes. I haven't tried that route as the switch-split was simple, minimal, and addressed the issue :-)

asfimport commented 7 years ago

Michael Braun (@michaelbraun) (migrated from JIRA)

We are still experiencing this - is there someone actively working on getting a solution committed?

asfimport commented 7 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I tried to work on a minimal patch that addresses the performance issue. It reuses the existing slow method to build a conversion map and then uses the conversion map at runtime. It seems to run an order of magnitude faster on my machine. I only see it as a short term solution, I think Uwe's plan is better for the long term.

asfimport commented 7 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

I think, we can for now replace the large switch statement with a resource file. I'd have 2 ideas: 1. A UTF-8 encoded file with 2 columns: first column is a single char, 2nd column is a series of replacements. I don't really like this approach as it is very sensitive to corrumption by editors and hard to commit correct 1. A simple file like int => int,int,int // comment, this is easy to parse and convert, but backside is that its harder to read the codepoints (for that we have a comment)

I wrote a Perl script to create mapping-FoldToASCII.txt, which is usable with MappingCharFilter, from the ASCIIFoldingFilter code - the script is actually embedded in that file, which is included in several of Solr's example configsets, e.g. under solr/server/solr/configsets/sample_techproducts_configs/conf/. Maybe this file could be used directly? It's human friendly, so would allow for easy user customization.

asfimport commented 7 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Same patch, but with comments.

asfimport commented 7 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

+1 @sarowe

asfimport commented 7 years ago

Michael Braun (@michaelbraun) (migrated from JIRA)

Should a new utility Char-To-Int map be created in lucene-core utils to store the char->int (combined chars) map?