Open asfimport opened 15 years ago
Earwin Burrfoot (migrated from JIRA)
I think right now this can be implemented as a delegating Directory.
Robert Muir (@rmuir) (migrated from JIRA)
Earwin, if implemented as a directory, we lose many of the advantages.
For example, if you are using BOCU-1, lets say with Hindi language, then according to the stats here: http://unicode.org/notes/tn6/#Performance
Note: I have never measured bocu performance in practice.
I took a look at the flex indexing branch and this appears like this might be possible in the future thru a codec...
Earwin Burrfoot (migrated from JIRA)
> Earwin, if implemented as a directory, we lose many of the advantages. Damn. I believed all strings pass through read/writeString() on IndexInput/Output. Naive. Well, one can patch UnicodeUtil, but the solution is no longer elegant. Waiting for flexible indexing, hoping it's gonna be flexible..
Robert Muir (@rmuir) (migrated from JIRA)
Waiting for flexible indexing, hoping it's gonna be flexible..
it looked to me, at a glance that some things would still be wierd. like TermVectors aren't "flexible" yet, so wouldn't be BOCU-1. I do not know if in flexible indexing, it will be possible for a codec to change behavior like this... maybe someone knows if this is planned eventually or not?
Michael McCandless (@mikemccand) (migrated from JIRA)
The flex API will let you completely customize how the terms dict/index is encoded, but not yet term vectors.
Mark Miller (@markrmiller) (migrated from JIRA)
pretty simple though, isnt it? Just pull the vector reader/writer from the codec as well?
Robert Muir (@rmuir) (migrated from JIRA)
The flex API will let you completely customize how the terms dict/index is encoded, but not yet term vectors.
Thanks Mike! as far as the encoding itself, BOCU-1 is available in the ICU library, so we do not need to implement it and deal with the conformance/patent stuff (To get the royalty-free patent you must be "fully compliant", they have already done this).
If this feature is desired, I think something like a Codec in contrib that encodes the index with BOCU-1 from ICU would be the best.
Robert Muir (@rmuir) (migrated from JIRA)
by the way, here are even more details on BOCU, including more in-depth size and performance, at least compared to the UTN: http://icu-project.org/repos/icu/icuhtml/trunk/design/conversion/bocu1/bocu1.html
Earwin Burrfoot (migrated from JIRA)
as far as the encoding itself, BOCU-1 is available in the ICU library
ICU's API requires to use ByteBuffer and CharBuffer for input/output. And even if I missed some nice method, encoder/decoder operates internally on said buffers. Thus, a wrap/unwrap for each String is inevitable.
Robert Muir (@rmuir) (migrated from JIRA)
ICU's API requires to use ByteBuffer and CharBuffer for input/output. And even if I missed some nice method, encoder/decoder operates internally on said buffers. Thus, a wrap/unwrap for each String is inevitable.
Earwin, at least in ICU trunk you have the following (public class) in com.ibm.icu.impl.BOCU:
public static int compress(String source, byte buffer[], int offset)
public static int getCompressionLength(String source)
...
But I think this class only supports encoding, not decoding (only used by Collation API for so called BOCSU). For decoding, we might have to use registered charset and ByteBuffer... unless theres another way.
Robert Muir (@rmuir) (migrated from JIRA)
Earwin, i do not really like this implementation either.
So it would be of course better to have something more suitable similar to UnicodeUtil, plus you could ditch the lib dependency. but then i guess we have to deal with this patent thing... i do not really know what is involved with that.
Earwin Burrfoot (migrated from JIRA)
but then i guess we have to deal with this patent thing... i do not really know what is involved with that.
CPAN holds BOCU-1 implementation, derived from "Sample C code", with all necessary copyrights and patent mentioned, but there's no word of them formally obtaining a license. I'm not sure if this is okay, or just overlooked.
DM Smith (migrated from JIRA)
The sample code is probably what is on this page, here: http://unicode.org/notes/tn6/#Sample_Code
From what I gather reading the whole page: If we port the sample code and the test case and then provide demonstration that all test pass, then we will be granted a license.
There's contact info at the bottom of the page for getting the license. Maybe, contact them for clarification?
As the code is fairly small, I don't think it would be too hard to port. The trick is that the sample code appears to deal in 32-bit arrays and we'd probably want a byte[].
Robert Muir (@rmuir) (migrated from JIRA)
attached is a simple prototype for encoding terms as BOCU-1
So while I don't think things like wildcard, etc will work due to the nature of BOCU-1, term and phrase queries should work fine, and it maintains UTF-8 order so sorting is fine, and range queries should work once we fix TermRangeQuery to use byte.
the impl is probably a bit slow (uses charset api) as its just for playing around.
note: I didnt check the box because of the patent thing, (not sure it even applies since i use the icu impl here), but either way i dont want to involve myself with that.
Uwe Schindler (@uschindler) (migrated from JIRA)
For correctness of code: target.offset = buffer.arrayOffset() + buffer.position(); But for most cases position() will be 0, but this is quite often an error. If you use limit() you have to use position(), else its inconsistent. arrayOffset() gives the offset corresponding to position=0. And length should be remaining()(for example see payload contrib IdentityEncoder)
And the default factory could be a singleton...
Robert Muir (@rmuir) (migrated from JIRA)
Uwe, sure, if we were to implement this I wouldnt use NIO anyway though, like i said i dont plan on committing anything (unless somethign is figured out about the patent), but it might be useful to someone.
I tested this on some hindi text:
encoding | tii | tis |
---|---|---|
utf-8 | 60,205 | 3,740,329 |
bocu-1 | 28,431 | 2,168,407 |
Uwe Schindler (@uschindler) (migrated from JIRA)
And one thing more, in the non-array case: buffer.get(target.bytes, target.offset, limit); target's offset should be set to 0 on all write operations to bytesref (see UnicodeUtil.UTF16toUTF8WithHash()). Else the grow() before does not resize correct!
Uwe Schindler (@uschindler) (migrated from JIRA)
Here the policed one :-)
In my opinion something is better than nothing. The patents are not violated here, as we only use an abstract API and the string "BOCU-1". You can use the same code to encode in "ISO-8859-1".
Uwe Schindler (@uschindler) (migrated from JIRA)
One more violation. Now its correct!
Uwe Schindler (@uschindler) (migrated from JIRA)
Here a heavy reusing variant.
Uwe Schindler (@uschindler) (migrated from JIRA)
The last one that could be used with any charset
Uwe Schindler (@uschindler) (migrated from JIRA)
Here is a 100% legally valid implementation:
I added further improvements to the encoder ittself:
Uwe Schindler (@uschindler) (migrated from JIRA)
A new patch that completely separates the BOCU factory from the implementation (which moves to common/miscellaneous). This has the following advantages:
TODO:
Michael McCandless (@mikemccand) (migrated from JIRA)
This is fabulous! And a great example of what's now possible w/ the cutover to opaque binary terms w/ flex – makes it easy to swap out how terms are encoded.
BOCU-1 is a much more compact encoding than UTF-8 for non-Latin languages.
This encoding would also naturally reduce the RAM required for the terms index and Terms/TermsIndex FieldCache (used when you sort by string field) as well, since Lucene just loads the [opaque] term bytes into RAM.
Robert Muir (@rmuir) (migrated from JIRA)
You can use any Charset to encode your terms. The javadocs should only note, that the byte[] order should be correct for range queries to work
I don't think we should add support for any non-unicode character sets.
If you want your complete index e.g. in ISO-8859-1
I am 100% against doing this.
Michael McCandless (@mikemccand) (migrated from JIRA)
Is there any reason not to make BOCU-1 Lucene's default encoding?
UTF8 penalizes non-english languages, and BOCU-1 does not, and it sounds like we expect little to no indexing or searching perf penalty (once we have a faster interface to BOCU1, eg our own private impl, like UnicodeUtil).
Robert Muir (@rmuir) (migrated from JIRA)
Is there any reason not to make BOCU-1 Lucene's default encoding?
in my opinion, just IBM :) But maybe we can make a strong implementation and they will approve it and give us a patent:
http://unicode.org/notes/tn6/#Intellectual_Property
UTF8 penalizes non-english languages, and BOCU-1 does not, and it sounds like we expect little to no indexing or searching perf penalty (once we have a faster interface to BOCU1, eg our own private impl, like UnicodeUtil).
I'd like to play with swapping it in as the default, just to see what problems (if any) there are, and to make sure all queries are supported, etc. I can upload a new patch that does it this way and we can play.
Michael McCandless (@mikemccand) (migrated from JIRA)
> Is there any reason not to make BOCU-1 Lucene's default encoding?
in my opinion, just IBM :)
But... ICU's license is compatible w/ ASL (I think), and includes a working impl of BOCU-1, so aren't we in the clear here? Ie we are free to take that impl, tweak it, add to our sources, and include ICU's license in our LICENSE/NOTICE?
Robert Muir (@rmuir) (migrated from JIRA)
But... ICU's license is compatible w/ ASL (I think), and includes a working impl of BOCU-1, so aren't we in the clear here? Ie we are free to take that impl, tweak it, add to our sources, and include ICU's license in our LICENSE/NOTICE?
I dont know... personally i wouldnt feel comfortable committing something without getting guidance first. but we can explore the technicals with patches on this jira issue and not check the box and i think this is all ok for now.
Robert Muir (@rmuir) (migrated from JIRA)
attached is a really really rough patch that sets bocu-1 as the default encoding.
Beware: its a work in progress and a lot of the patch is auto-generated (eclipse) so some things need to be reverted.
Most tests pass, the idea is to find bugs in tests etc that abuse bytesref/assume utf-8 encoding, things like that.
Robert Muir (@rmuir) (migrated from JIRA)
btw that patch is huge because i just sucked in the icu charset stuff to have an implementation that works for testing...
its not intended to ever be that way as we would just implement the stuff we need without this code, but it makes it easier to test since you dont need any external jars or muck with the build system at all.
Robert Muir (@rmuir) (migrated from JIRA)
attached is a patch for the start of a "BOCUUtil' with unicodeutil like methods.
For now i only implemented encode (and encodeWithHash):
I generated random strings with _TestUtil.randomRealisticUnicodeString and benchmarked, and the numbers are stable.
encoding | time to encode 20 million strings (ms) | number of encoded bytes |
---|---|---|
UTF-8 | 1,757 | 596,516,000 |
BOCU-1 | 1,968 | 250,202,000 |
So I think we get good compression, and good performance close to UTF-8 for encode. I'll work on decode now.
Michael McCandless (@mikemccand) (migrated from JIRA)
Slightly more optimized version of BOCU1 encode (but it's missing the hash variant).
Michael McCandless (@mikemccand) (migrated from JIRA)
Duh – that was some ancient wrong patch. This one should be right!
Michael McCandless (@mikemccand) (migrated from JIRA)
Just inlines the 2-byte diff case.
Michael McCandless (@mikemccand) (migrated from JIRA)
Inlines/unwinds the 3-byte cases. I think we can leave the 4 byte case as a for loop...
Robert Muir (@rmuir) (migrated from JIRA)
I ran tests, each one of Mike's optimizations speed up the encode...
I think I agree with not unrolling the 4-byte, the "diff" from the previous character has to be > 187659 [0x2dd0b] this is like pottery writings and oracle bone script... but the previous ones (2x, 3x) speed up CJK and other scripts and are very useful.
Robert Muir (@rmuir) (migrated from JIRA)
removed some ifs for the positive unrolled cases.
Robert Muir (@rmuir) (migrated from JIRA)
i optimized the surrogate case here, moving it into the 'prev' calculation. now we are faster than utf-8 on average for encode.
encoding | time to encode 20 million strings (ms) | number of encoded bytes |
---|---|---|
UTF-8 | 1,756 | 596,516,000 |
BOCU-1 | 1,724 | 250,202,000 |
Robert Muir (@rmuir) (migrated from JIRA)
oops, forgot a check in the surrogate case.
Robert Muir (@rmuir) (migrated from JIRA)
here it is with first stab at decoder (its correct against random icu strings, but i didnt benchmark yet)
Yonik Seeley (@yonik) (migrated from JIRA)
I took a stab at benchmarking encoding speed only with some different languages. I encoded a word at a time (which happens at indexing time). I used some text from wikipedia in different languages: english, german, french, spanish, and chinese. I used WhitespaceAnalyzer for the first 4 and StandardAnalyzer for the chinese (but analysis speed is not measured.)
encoding | english | german | french | spanish | chinese |
---|---|---|---|---|---|
UTF-8 size | 1888 | 4550 | 4875 | 5123 | 4497 |
BOCU-1 size | 1888 | 4610 | 4995 | 5249 | 4497 |
BOCU slowdown | 29% | 39% | 47% | 61% | 80% |
I suspect that the StandardAnalyzer is spitting out individual CJK chars, and hence the same size of BOCU-1 and UTF-8? I'll try and see if I can get SmartChineseAnalyzer working and re-run the chinese test.
Robert Muir (@rmuir) (migrated from JIRA)
yonik, what were you benchmarking? I think you should benchmark overall indexing time, of which encode is just a blip (<1% of).
and yes, since the start state is 0x40 the FIRST cjk char is a diff from 0x40, but any subsequent ones yield savings.
in general you wont get much compression for chinese.. id say max 25% for russian, arabic, hebrew, japanese it will do a lot better: max 40% for indian languages you tend to get about 50%.
I also dont know how you encoded word at a time, because i get quite different results. I focused a lot on 'single-byte diffs' to be fast (e.g. just subtraction) and I think i do a lot better for english than the 160% described in http://unicode.org/notes/tn6/
Furthermore, utf-8 is a complete no-op for english, so being a compression algorithm that is only 29% slower than (byte) char is good in my book, but i dont measure 29% for english.
I don't think there is any problem in encode speed at all.
Michael Busch (migrated from JIRA)
Yonik can you give more details about how you ran your tests?
Was it an isolated string encoding test or does BOCU slow down overall indexing speed by 29%-80% (which would be hard to believe).
Yonik Seeley (@yonik) (migrated from JIRA)
Yonik can you give more details about how you ran your tests?
I'm isolating encoding speed only (not analysis, not indexing, etc) of tokens in different languages. So I took some text from wikipedia, analyze it to get a list of char[], then encode each char[] in a loop. It's only the last step that is benchmarked to isolate the encode performance. I'm certainly not claiming that indexing is n% slower.
Robert Muir (@rmuir) (migrated from JIRA)
attached is my benchmark for english text.
UTF-8: 15530ms BOCU-1: 15687ms
Note, i use a Sun JVM 1.6.0_19 (64bit)
Yonik if you run this benchmark and find a problem with it / or its slower on your machine, let me know your configuration, because i dont see the results you do.
Yonik Seeley (@yonik) (migrated from JIRA)
in general you wont get much compression for chinese.. id say max 25%
Ah, OK. I just tried russian w/ whitespace analyzer used to split and did get a good size savings:
UTF8_size=11056 BOCU-1_size=6810 BOCU-1_slowdown=32%
Robert Muir (@rmuir) (migrated from JIRA)
Yonik, please see my issue.
the fact we can encode 100 million terms in 15 seconds, means any speed stuff is meaningless (though i still insist, something is wrong: either your benchmark, or it runs slower on your JDK or something (which we should try to improve)
Michael McCandless (@mikemccand) (migrated from JIRA)
The char[] -> byte[] encode time is a miniscule part of indexing time. And, in turn, indexing time is far less important than impact on search performance. So... let's focus on the search performance here.
Most queries are unaffected by the term encoding; it's only AutomatonQuery (= fuzzy, regexp, wildcard) that do a fair amount of decoding...
Net/net BOCU1 sounds like an awesome win over UTF8.
Robert Muir (@rmuir) (migrated from JIRA)
I just insist there is no real difference between this and UTF-8 for encoding english...
In lucene-1793, there is the off-topic suggestion to provide compression of Unicode data. The motivation was a custom encoding in a Russian analyzer. The original supposition was that it provided a more compact index.
This led to the comment that a different or compressed encoding would be a generally useful feature.
BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM with an implementation in ICU. If Lucene provide it's own implementation a freely avIlable, royalty-free license would need to be obtained.
SCSU is another Unicode compression algorithm that could be used.
An advantage of these methods is that they work on the whole of Unicode. If that is not needed an encoding such as iso8859-1 (or whatever covers the input) could be used.
Migrated from LUCENE-1799 by DM Smith, 2 votes, updated Nov 30 2013 Attachments: Benchmark.java (versions: 3), LUCENE-1779.patch, LUCENE-1799_big.patch, LUCENE-1799.patch (versions: 14)