Unicode compression [LUCENE-1799]

asfimport commented 15 years ago

In lucene-1793, there is the off-topic suggestion to provide compression of Unicode data. The motivation was a custom encoding in a Russian analyzer. The original supposition was that it provided a more compact index.

This led to the comment that a different or compressed encoding would be a generally useful feature.

BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM with an implementation in ICU. If Lucene provide it's own implementation a freely avIlable, royalty-free license would need to be obtained.

SCSU is another Unicode compression algorithm that could be used.

An advantage of these methods is that they work on the whole of Unicode. If that is not needed an encoding such as iso8859-1 (or whatever covers the input) could be used.

Migrated from LUCENE-1799 by DM Smith, 2 votes, updated Nov 30 2013 Attachments: Benchmark.java (versions: 3), LUCENE-1779.patch, LUCENE-1799_big.patch, LUCENE-1799.patch (versions: 14)

asfimport commented 15 years ago

Earwin Burrfoot (migrated from JIRA)

I think right now this can be implemented as a delegating Directory.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Earwin, if implemented as a directory, we lose many of the advantages.

For example, if you are using BOCU-1, lets say with Hindi language, then according to the stats here: http://unicode.org/notes/tn6/#Performance

you can encode/decode BOCU-1 to/from UTF-16 more than twice as fast as you can UTF-8 to/from UTF-16 (for this language)
also, resulting bytes are less than half the size of UTF-8 (for this language), yet sort order is still preserved, so it should work for term dictionary, etc.

Note: I have never measured bocu performance in practice.

I took a look at the flex indexing branch and this appears like this might be possible in the future thru a codec...

asfimport commented 14 years ago

Earwin Burrfoot (migrated from JIRA)

> Earwin, if implemented as a directory, we lose many of the advantages. Damn. I believed all strings pass through read/writeString() on IndexInput/Output. Naive. Well, one can patch UnicodeUtil, but the solution is no longer elegant. Waiting for flexible indexing, hoping it's gonna be flexible..

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Waiting for flexible indexing, hoping it's gonna be flexible..

it looked to me, at a glance that some things would still be wierd. like TermVectors aren't "flexible" yet, so wouldn't be BOCU-1. I do not know if in flexible indexing, it will be possible for a codec to change behavior like this... maybe someone knows if this is planned eventually or not?

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

The flex API will let you completely customize how the terms dict/index is encoded, but not yet term vectors.

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

pretty simple though, isnt it? Just pull the vector reader/writer from the codec as well?

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

The flex API will let you completely customize how the terms dict/index is encoded, but not yet term vectors.

Thanks Mike! as far as the encoding itself, BOCU-1 is available in the ICU library, so we do not need to implement it and deal with the conformance/patent stuff (To get the royalty-free patent you must be "fully compliant", they have already done this).

If this feature is desired, I think something like a Codec in contrib that encodes the index with BOCU-1 from ICU would be the best.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

by the way, here are even more details on BOCU, including more in-depth size and performance, at least compared to the UTN: http://icu-project.org/repos/icu/icuhtml/trunk/design/conversion/bocu1/bocu1.html

asfimport commented 14 years ago

Earwin Burrfoot (migrated from JIRA)

as far as the encoding itself, BOCU-1 is available in the ICU library

ICU's API requires to use ByteBuffer and CharBuffer for input/output. And even if I missed some nice method, encoder/decoder operates internally on said buffers. Thus, a wrap/unwrap for each String is inevitable.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

ICU's API requires to use ByteBuffer and CharBuffer for input/output. And even if I missed some nice method, encoder/decoder operates internally on said buffers. Thus, a wrap/unwrap for each String is inevitable.

Earwin, at least in ICU trunk you have the following (public class) in com.ibm.icu.impl.BOCU:

public static int compress(String source, byte buffer[], int offset)
public static int getCompressionLength(String source) 
...

But I think this class only supports encoding, not decoding (only used by Collation API for so called BOCSU). For decoding, we might have to use registered charset and ByteBuffer... unless theres another way.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Earwin, i do not really like this implementation either.

So it would be of course better to have something more suitable similar to UnicodeUtil, plus you could ditch the lib dependency. but then i guess we have to deal with this patent thing... i do not really know what is involved with that.

asfimport commented 14 years ago

Earwin Burrfoot (migrated from JIRA)

but then i guess we have to deal with this patent thing... i do not really know what is involved with that.

CPAN holds BOCU-1 implementation, derived from "Sample C code", with all necessary copyrights and patent mentioned, but there's no word of them formally obtaining a license. I'm not sure if this is okay, or just overlooked.

asfimport commented 14 years ago

DM Smith (migrated from JIRA)

The sample code is probably what is on this page, here: http://unicode.org/notes/tn6/#Sample_Code

From what I gather reading the whole page: If we port the sample code and the test case and then provide demonstration that all test pass, then we will be granted a license.

There's contact info at the bottom of the page for getting the license. Maybe, contact them for clarification?

As the code is fairly small, I don't think it would be too hard to port. The trick is that the sample code appears to deal in 32-bit arrays and we'd probably want a byte[].

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

attached is a simple prototype for encoding terms as BOCU-1

So while I don't think things like wildcard, etc will work due to the nature of BOCU-1, term and phrase queries should work fine, and it maintains UTF-8 order so sorting is fine, and range queries should work once we fix TermRangeQuery to use byte.

the impl is probably a bit slow (uses charset api) as its just for playing around.

note: I didnt check the box because of the patent thing, (not sure it even applies since i use the icu impl here), but either way i dont want to involve myself with that.

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

For correctness of code: target.offset = buffer.arrayOffset() + buffer.position(); But for most cases position() will be 0, but this is quite often an error. If you use limit() you have to use position(), else its inconsistent. arrayOffset() gives the offset corresponding to position=0. And length should be remaining()(for example see payload contrib IdentityEncoder)

And the default factory could be a singleton...

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Uwe, sure, if we were to implement this I wouldnt use NIO anyway though, like i said i dont plan on committing anything (unless somethign is figured out about the patent), but it might be useful to someone.

I tested this on some hindi text:

encoding	tii	tis
utf-8	60,205	3,740,329
bocu-1	28,431	2,168,407

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

And one thing more, in the non-array case: buffer.get(target.bytes, target.offset, limit); target's offset should be set to 0 on all write operations to bytesref (see UnicodeUtil.UTF16toUTF8WithHash()). Else the grow() before does not resize correct!

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Here the policed one :-)

In my opinion something is better than nothing. The patents are not violated here, as we only use an abstract API and the string "BOCU-1". You can use the same code to encode in "ISO-8859-1".

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

One more violation. Now its correct!

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Here a heavy reusing variant.

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

The last one that could be used with any charset

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Here is a 100% legally valid implementation:

Linking to icu4j-charsets is done dynamically by reflection. If you don't have ICU4J charsets in your classpath, the attribute throws explaining exception
We dont need to ship the rather large JAR file with Lucene just for this class
We dont have legal patent problems as we neither ship the API nor use it directly
The backside is that the Test simple prints a warning but passes, so the class is not tested until you install icu4j-charsets.jar. We can put the JAR file on hudson, so it can be used during nightly builds. Or we download it dynamically on build.

I added further improvements to the encoder ittself:

less variables
correct error handling for encoding errors
remove floating point from main loop

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

A new patch that completely separates the BOCU factory from the implementation (which moves to common/miscellaneous). This has the following advantages:

You can use any Charset to encode your terms. The javadocs should only note, that the byte[] order should be correct for range queries to work
Theoretically you could remove the BOCU classes at all, one that wants to use, can simply get the Charset from ICUs factory and pass it to the AttributeFactory. The convenience class is still useful, especially if we can later natively implement the encoding without NIO (when patent issues are solved...)
The test for the CustomCharsetTermAttributeFactory uses UTF-8 as charset and verifies that the created BytesRefs have the same format like a BytesRef created using the UnicodeUtils.
The test also checks that encoding errors are bubbled up as RuntimeExceptions

TODO:

docs
handling of encoding errors configureable (replace with replacement char?)
If you want your complete index e.g. in ISO-8859-1, there should be also convenience methods that take CharSequences/char[] in the factory/attribute to quickly convert strings to BytesRefs like UnicodeUtil does - by this its possible to create TermQueries directly using e.g. ISO-8859-1 encoding.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

This is fabulous! And a great example of what's now possible w/ the cutover to opaque binary terms w/ flex – makes it easy to swap out how terms are encoded.

BOCU-1 is a much more compact encoding than UTF-8 for non-Latin languages.

This encoding would also naturally reduce the RAM required for the terms index and Terms/TermsIndex FieldCache (used when you sort by string field) as well, since Lucene just loads the [opaque] term bytes into RAM.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

You can use any Charset to encode your terms. The javadocs should only note, that the byte[] order should be correct for range queries to work

I don't think we should add support for any non-unicode character sets.

If you want your complete index e.g. in ISO-8859-1

I am 100% against doing this.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Is there any reason not to make BOCU-1 Lucene's default encoding?

UTF8 penalizes non-english languages, and BOCU-1 does not, and it sounds like we expect little to no indexing or searching perf penalty (once we have a faster interface to BOCU1, eg our own private impl, like UnicodeUtil).

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Is there any reason not to make BOCU-1 Lucene's default encoding?

in my opinion, just IBM :) But maybe we can make a strong implementation and they will approve it and give us a patent:

http://unicode.org/notes/tn6/#Intellectual_Property

UTF8 penalizes non-english languages, and BOCU-1 does not, and it sounds like we expect little to no indexing or searching perf penalty (once we have a faster interface to BOCU1, eg our own private impl, like UnicodeUtil).

I'd like to play with swapping it in as the default, just to see what problems (if any) there are, and to make sure all queries are supported, etc. I can upload a new patch that does it this way and we can play.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

> Is there any reason not to make BOCU-1 Lucene's default encoding?

in my opinion, just IBM :)

But... ICU's license is compatible w/ ASL (I think), and includes a working impl of BOCU-1, so aren't we in the clear here? Ie we are free to take that impl, tweak it, add to our sources, and include ICU's license in our LICENSE/NOTICE?

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

But... ICU's license is compatible w/ ASL (I think), and includes a working impl of BOCU-1, so aren't we in the clear here? Ie we are free to take that impl, tweak it, add to our sources, and include ICU's license in our LICENSE/NOTICE?

I dont know... personally i wouldnt feel comfortable committing something without getting guidance first. but we can explore the technicals with patches on this jira issue and not check the box and i think this is all ok for now.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

attached is a really really rough patch that sets bocu-1 as the default encoding.

Beware: its a work in progress and a lot of the patch is auto-generated (eclipse) so some things need to be reverted.

Most tests pass, the idea is to find bugs in tests etc that abuse bytesref/assume utf-8 encoding, things like that.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

btw that patch is huge because i just sucked in the icu charset stuff to have an implementation that works for testing...

its not intended to ever be that way as we would just implement the stuff we need without this code, but it makes it easier to test since you dont need any external jars or muck with the build system at all.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

attached is a patch for the start of a "BOCUUtil' with unicodeutil like methods.

For now i only implemented encode (and encodeWithHash):

I generated random strings with _TestUtil.randomRealisticUnicodeString and benchmarked, and the numbers are stable.

encoding	time to encode 20 million strings (ms)	number of encoded bytes
UTF-8	1,757	596,516,000
BOCU-1	1,968	250,202,000

So I think we get good compression, and good performance close to UTF-8 for encode. I'll work on decode now.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Slightly more optimized version of BOCU1 encode (but it's missing the hash variant).

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Duh – that was some ancient wrong patch. This one should be right!

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Just inlines the 2-byte diff case.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Inlines/unwinds the 3-byte cases. I think we can leave the 4 byte case as a for loop...

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I ran tests, each one of Mike's optimizations speed up the encode...

I think I agree with not unrolling the 4-byte, the "diff" from the previous character has to be > 187659 [0x2dd0b] this is like pottery writings and oracle bone script... but the previous ones (2x, 3x) speed up CJK and other scripts and are very useful.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

removed some ifs for the positive unrolled cases.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

i optimized the surrogate case here, moving it into the 'prev' calculation. now we are faster than utf-8 on average for encode.

encoding	time to encode 20 million strings (ms)	number of encoded bytes
UTF-8	1,756	596,516,000
BOCU-1	1,724	250,202,000

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

oops, forgot a check in the surrogate case.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

here it is with first stab at decoder (its correct against random icu strings, but i didnt benchmark yet)

asfimport commented 14 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

I took a stab at benchmarking encoding speed only with some different languages. I encoded a word at a time (which happens at indexing time). I used some text from wikipedia in different languages: english, german, french, spanish, and chinese. I used WhitespaceAnalyzer for the first 4 and StandardAnalyzer for the chinese (but analysis speed is not measured.)

encoding	english	german	french	spanish	chinese
UTF-8 size	1888	4550	4875	5123	4497
BOCU-1 size	1888	4610	4995	5249	4497
BOCU slowdown	29%	39%	47%	61%	80%

I suspect that the StandardAnalyzer is spitting out individual CJK chars, and hence the same size of BOCU-1 and UTF-8? I'll try and see if I can get SmartChineseAnalyzer working and re-run the chinese test.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

yonik, what were you benchmarking? I think you should benchmark overall indexing time, of which encode is just a blip (<1% of).

and yes, since the start state is 0x40 the FIRST cjk char is a diff from 0x40, but any subsequent ones yield savings.

in general you wont get much compression for chinese.. id say max 25% for russian, arabic, hebrew, japanese it will do a lot better: max 40% for indian languages you tend to get about 50%.

I also dont know how you encoded word at a time, because i get quite different results. I focused a lot on 'single-byte diffs' to be fast (e.g. just subtraction) and I think i do a lot better for english than the 160% described in http://unicode.org/notes/tn6/

Furthermore, utf-8 is a complete no-op for english, so being a compression algorithm that is only 29% slower than (byte) char is good in my book, but i dont measure 29% for english.

I don't think there is any problem in encode speed at all.

asfimport commented 14 years ago

Michael Busch (migrated from JIRA)

Yonik can you give more details about how you ran your tests?

Was it an isolated string encoding test or does BOCU slow down overall indexing speed by 29%-80% (which would be hard to believe).

asfimport commented 14 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Yonik can you give more details about how you ran your tests?

I'm isolating encoding speed only (not analysis, not indexing, etc) of tokens in different languages. So I took some text from wikipedia, analyze it to get a list of char[], then encode each char[] in a loop. It's only the last step that is benchmarked to isolate the encode performance. I'm certainly not claiming that indexing is n% slower.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

attached is my benchmark for english text.

UTF-8: 15530ms BOCU-1: 15687ms

Note, i use a Sun JVM 1.6.0_19 (64bit)

Yonik if you run this benchmark and find a problem with it / or its slower on your machine, let me know your configuration, because i dont see the results you do.

asfimport commented 14 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

in general you wont get much compression for chinese.. id say max 25%

Ah, OK. I just tried russian w/ whitespace analyzer used to split and did get a good size savings:

UTF8_size=11056 BOCU-1_size=6810 BOCU-1_slowdown=32%

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Yonik, please see my issue.

the fact we can encode 100 million terms in 15 seconds, means any speed stuff is meaningless (though i still insist, something is wrong: either your benchmark, or it runs slower on your JDK or something (which we should try to improve)

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

The char[] -> byte[] encode time is a miniscule part of indexing time. And, in turn, indexing time is far less important than impact on search performance. So... let's focus on the search performance here.

Most queries are unaffected by the term encoding; it's only AutomatonQuery (= fuzzy, regexp, wildcard) that do a fair amount of decoding...

Net/net BOCU1 sounds like an awesome win over UTF8.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I just insist there is no real difference between this and UTF-8 for encoding english...

apache / lucene

Unicode compression [LUCENE-1799] #2874