dnrajugade / guava-libraries

Automatically exported from code.google.com/p/guava-libraries
Apache License 2.0
0 stars 0 forks source link

CharMatcher inefficiencies #1223

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
There are some performance problems with the CharMatcher.

1. Missing `setBits`.

The precomputation may take up to 1/3 millisecond due to missing method 
`anyOf.setBits`. Adding the one-liner makes it 15+ times faster. The classes 
RangesMatcher, WHITESPACE, BREAKING_WHITESPACE, ANY, and NONE could all easily 
implement `setBits` too, but this is less important.

2. Problems of `SmallCharMatcher`.

2.1. Slow precomputation.

After fixing  the subissue 1, the precomputation of `SmallCharMatcher` is still 
rather slow. For example for GERMAN_59 (the set of all German letters; see the 
attached benchmark) it tries all sizes from 59 to 128. This problem is rather 
inherent, as the search for a collision-less table has rather negligible 
success chances due to the "birthday paradox".

2.2. Low success chances.

The attached `SmallMatcherDemo` shows that for more than 30 randomly chosen 
chars, using `SmallCharMatcher` makes hardly any sense: You spend quite some 
time in the precomputation just to most probably end up with `table.length=128` 
and `reprobe=true`. It's even easy to create a contrived example where this 
happens for 13 chars only (`reprobe=true` despite the resulting load factor of 
0.1).

3. No need for `SmallCharMatcher.reprobe`.

In case the `SmallCharMatcher` fails to fit the chars without collisions, it 
falls back to reprobing. But `SmallCharMatcher.from` should better give up, as 
the `MediumCharMatcher` can do the job faster because of using bitwise `&` 
instead of `%` (with `reprobe=true` the modulus always equals to 128).

4. No savings from avoiding the `BitSetMatcher`.

Since Issue 1049, the old criterion for using the hashing matchers doesn't make 
much sense anymore. By using the hashing matchers you hope to trade speed for 
substantial memory savings, but the `BitSetMatcher` trims its table and thus 
often needs less memory than the others.

As an example, consider GERMAN_69 (the set of all German letters and digits). 
Here, totalChars=69 gets doubled and rounded up to the next power of two, thus 
the resulting `MediumCharMatcher` takes 256 chars, i.e., 512 bytes (plus some 
fixed overhead). As all the included characters lie below 256, the 
`BitSetMatcher` would trim its table to `256/64=4` longs, i.e., 32 bytes.

So the created matcher is both over 20 times slower and (ignoring the fixed 
overhead) 8 times bigger than necessary. This is a special case caused by using 
Latin-1 only, but similar cases might be quite common, as Unicode was designed 
so that the most common chars are small (unless dealing with some for us exotic 
language).

5. Bad performance of both hashing matchers.

Both hash-based matchers get slow due to a cause similar to Issue 1155:

> The problem is caused by 1. big continuous filled area of the table, 2. using 
linear reprobing.
> Any lookup starting near the beginning of such a continuous area takes a very 
long time as it has to traverse to its end.

While such areas might be rare in case of a general `ImmutableSet`, with 
character ranges they might be quite common.

The solution used for the `ImmutableSet` would work here too, but it'd be 
relatively more costly (The fix for `ImmutableSet` was simply replacing 
`Hashing.smear` by a comparably expensive function, but here it'd mean adding 
such a function). I hope to be able to offer a good solution soon.

NOTES

The patch is meant to demonstrate the issue, which makes it much larger than 
the proposed changes themselves need.

The benchmark has two parts: measuring the precomputation time and measuring 
the performance of `countIn(s)` where `s` a permutation of all 2**16 chars. The 
latter actually measures the speed of `matches`.

Original issue reported on code.google.com by Maaarti...@gmail.com on 3 Dec 2012 at 10:18

Attachments:

GoogleCodeExporter commented 9 years ago
I'm not convinced all of these are worth pursuing, largely because I'd tend to 
expect CharMatcher.precomputed() to be called at most once per CharMatcher.

I'll look into these, though.

Original comment by lowas...@google.com on 3 Dec 2012 at 10:21

GoogleCodeExporter commented 9 years ago
Only the first two subissues deal with the timing of the `precomputed()` 
method. I'm not claiming it's important, but the fix is shorter than this 
paragraph.

The other thing is that sometimes you *lose* performance after precomputation 
(e.g., for the set of all German letters you get a 5x slowdown).

Original comment by Maaarti...@gmail.com on 12 Dec 2012 at 8:01

GoogleCodeExporter commented 9 years ago
https://code.google.com/p/guava-libraries/source/detail?r=7218498c1e19a789a5064a
e750db7e929e2e8287

Original comment by lowas...@google.com on 14 Dec 2012 at 6:19

GoogleCodeExporter commented 9 years ago

Original comment by lowas...@google.com on 26 Dec 2012 at 5:02

GoogleCodeExporter commented 9 years ago
This issue has been migrated to GitHub.

It can be found at https://github.com/google/guava/issues/<id>

Original comment by cgdecker@google.com on 1 Nov 2014 at 4:13

GoogleCodeExporter commented 9 years ago

Original comment by cgdecker@google.com on 3 Nov 2014 at 9:08