asfimport commented 3 years ago

Summary

Java9 introduced VarHandles as a tool to quickly read and write primitive types directly to byte arrays without bound checks. The LZ4 compressor class must consistently read ints from a byte array to analyze matches. The performance can be improved by reading these using a VarHandle.

Additionally, the LZ4 compressor/decompressor methods currently individually read/write the bytes for LE shorts. Lucene's DataOutput/DataInput abstractions already have dedicated methods for reading/writing LE shorts. These methods are selectively optimized in certain implementations and will provide superior performance than individual byte reads.

Concerns

The DataOutput/DataInput readShort() and writeShort() methods do not call out that they are LE. It just looks to me that the DataOutput/DataInput are LE? Since this particular change does not appear to provide significant performance wins, maybe the patch is better leaving the explicit individual byte reads?

Additionally, this patch changes read ints to read them in the platform native order which should be fine since it is just matching bytes. But I can change it to only read in the order the previous version did.

Benchmarks

I created JMH benchmarks which compresses 1MB of highly compressible JSON observability data. And compresses it 64KB at a time. In order to simulate the "short" changes, I use a forked version ByteArrayDataOutput which writes shorts using a VarHandle (to simulate fast writes that the ByteBuffer versions would get.) I also ran a benchmark without the short changes, just the reading ints using a VarHandle.

Benchmark                                          Mode  Cnt    Score   Error  Units
MyBenchmark.testCompressLuceneLZ4                 thrpt    9  712.430 ± 3.616  ops/s
MyBenchmark.testCompressLuceneLZ4Forked           thrpt    9  945.380 ± 4.776  ops/s
MyBenchmark.testCompressLuceneLZ4ForkedNoShort    thrpt    9  940.812 ± 3.868  ops/s
MyBenchmark.testCompressLuceneLZ4HC               thrpt    9  147.432 ± 4.730  ops/s
MyBenchmark.testCompressLuceneLZ4HCForked         thrpt    9  183.954 ± 2.534  ops/s
MyBenchmark.testCompressLuceneLZ4HCForkedNoShort  thrpt    9  188.065 ± 0.727  ops/s

Migrated from LUCENE-10112 by Tim Brooks, resolved Sep 20 2021 Attachments: LUCENE-10112.patch Linked issues:

11150

Pull requests: https://github.com/apache/lucene/pull/310

asfimport commented 3 years ago

Tim Brooks (migrated from JIRA)

After researching more, it appears to me that #10089 is where the guarantee that the shorts are LE was introduced. However, since the directory format is different than the LZ4 format, I am leaning towards avoiding the dependency between the two. Let me know what you all think. Since the majority of the performance improvement here seems to be the int VarHandle, I am leaning towards the patch should only include that change and the LZ4 class continue to explicitly read/write LE shorts byte by byte.

asfimport commented 3 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Are you sure the patch is correct? The Lucene decoder in readInt() is BIG_ENDIAN and you use native order when creating the varHandle, so it looks like this won't work correctly on all platforms - does it pass tests on little endian (intel) at all?:

   private static int readInt(byte[] buf, int i) {
-    return ((buf[i] & 0xFF) << 24)
-        | ((buf[i + 1] & 0xFF) << 16)
-        | ((buf[i + 2] & 0xFF) << 8)
-        | (buf[i + 3] & 0xFF);
+    return (int) intPlatformNative.get(buf, i);
   }

asfimport commented 3 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

After researching more, it appears to me that #10089 is where the guarantee that the shorts are LE was introduced.

This was changed from big endian in Lucene 1 to 8 to little endian in Lucene 9+. I was always defined as being with specific endian since early times.

LZ4 is different, as it does not use directory APIs in all cases. Also it is not sure for which verison of lucene your patch is made.

asfimport commented 3 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Ah you said the readInt is just reading the bytes to compare them with other bytes, so byte order does not matter? In that case native order is better, yes.

I am not familiar with the code of LZ4 and where readInt is used.

Maybe provide a pull request to make review easier.

asfimport commented 3 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Unrelated to your current patch regarding LZ4: The ByteArrayDataInput and ByteArrayDataOutput classes could use the same trick with VarHandles for all data types (short, int, long), but there with ByteOrder.LITTLE_ENDIAN (in Lucene 9+). This may be an interesting change, too.

With LZ4, I have no idea how the readInt() method is used internally and if byte order makes a difference or not.

Regarding readShort: The LZ4 code is older than Lucene 9's change to Little Endian for file formats. So we may change it to use writeShort(). In combination with a possible ByteArrayDataOutput improvement this could improve performance, but with the current code your observation is mostyl correct: No change by writeShort().

asfimport commented 3 years ago

Tim Brooks (migrated from JIRA)

> Maybe provide a pull request to make review easier.

I will look into doing this soon.

> Unrelated to your current patch regarding LZ4: The ByteArrayDataInput and ByteArrayDataOutput classes could use the same trick with VarHandles for all data types (short, int, long), but there with ByteOrder.LITTLE_ENDIAN (in Lucene 9+). This may be an interesting change, too.

Yeah. I have prototyped a patch related to this too that I will submit in a few days. Just wanted to started with LZ4 as it is an area I have been focused on recently related to Elasticsearch. And reading a bunch of ints without bound checks was a significant performance improvement I had noticed with the lz4-java library.

I'm not sure how common ByteArrayDataInput are used in Lucene. A lot of the flame graphs I look at in Elasticsearch are indexing into the ByteBuffer variants (and byte buffers internally already have some of these optimizations). But since Lucene supports Java9 on 9 there is no reason not to do the VarHandles for byte arrays.

asfimport commented 3 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Yeah. I have prototyped a patch related to this too that I will submit in a few days.

I opened #11150. It is all done and the PR is underway. Including a missing test for the class.

asfimport commented 3 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

About the styling in your patch: Static final should always be uppercased. A special case in the JDK/Java community are MethodHandles, which should start with "MH" and can then use lowercase characters to refer to a "real method name". The same is about VarHandles, they should start with "VH". If they refer to fields in a class, those names can be lowercased then, but here I'd use all-uppercase, because those VarHandles are not linked to field names. See he PR on the related issue.

I would take this issue and apply it, but I'd like to clarify if the native endianness makes a difference for encoded/decoded LZ4 streams. If they behave different in the algorithm, I'd stick with BE for now. Maybe @jpountz can clarify.

asfimport commented 3 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

The DataOutput/DataInput readShort() and writeShort() methods do not call out that they are LE. It just looks to me that the DataOutput/DataInput are LE?

Could you open an issue to clarify the javadocs on DataInput and DataOutput to state the encoding of primitive reads? In main branch (9.0) those are Little Endian, but in fact it is written down nowhere. The documentation update should maybe also backported, replacing LE by BE.

asfimport commented 3 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I would take this issue and apply it, but I'd like to clarify if the native endianness makes a difference for encoded/decoded LZ4 streams. If they behave different in the algorithm, I'd stick with BE for now. Maybe Adrien Grand can clarify.

The only difference that the byte order makes there is that you could get different hash collisions so a BE platform could miss some duplicate strings in the input that a LE platform would find or vice-versa, but other than that data that gets compressed on a BE platform can still be decompressed on a LE platform and vice-versa.

Using the platform's byte order is the same approach as the C implementation of LZ4 makes. I wonder if we should enforce LE all the time instead to get the performance benefits on most common platforms without the downside of making the same input documents generate different bytes on disk depending on the byte order.

asfimport commented 3 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Hi,

after reading the LZ4 java library on github I came to the same conclusion, see README @ https://github.com/lz4/lz4-java#compatibility-notes:

Compressors might not generate the same compressed streams on all platforms, especially if CPU endianness differs, but the compressed streams can be safely decompressed by any decompressor implementation on any platform.

So for speed reasons I'd stay with the patch (but we can use DataOutput#writeShort() as suggested - because it is now defined to be little endian).

Alternative: If we want to change to LE in Lucene 9, then I would simply delay this issue until #11150 is done and use the BitUtil#VH_LE_INT static field to access the reuseable var handle to do this.

So short question: be as fast as possible and allow different LZ4 output on different platforms? Or change to LE?

I would agree to first one, because we have no requirement to produce same index format for every platform endianness. We have at other places already some markers in data files (Robert found those in Lucene90PostingsWriter).

asfimport commented 3 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

The byte order mark in Lucene90PostingsWriter is a relic. It just writes this marker byte, but its not used anywhere. We should remove the code!

asfimport commented 3 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

See #11151 about this anomaly. :-)

asfimport commented 3 years ago

Robert Muir (@rmuir) (migrated from JIRA)

just to clarify, AFAIK using these varhandle methods doesn't do magic "without bounds checks". just maybe less of them? e.g., if i want to read a long out of a byte[], there really needs to only be one range check, instead of eight.

so using VH may not speed up all usages. If hotspot compiler is already eliminating the bounds checks for the existing code, I expect no speedup. But in many cases this is probably not happening, and I like the clarity/consistency of using JDK conversion methods versus "hand-rolled" stuff.

+1 to Adrien's idea to just use LE converter here. I really think ByteOrder.nativeOrder() should be in forbidden APIs myself. The problem is that no jenkins is testing lucene on bigendian platforms: by using native order (even if we think it is safe), it introduces risks. So personally I would rather see explicit endianness.

asfimport commented 3 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

just to clarify, AFAIK using these varhandle methods doesn't do magic "without bounds checks". just maybe less of them? e.g., if i want to read a long out of a byte[], there really needs to only be one range check, instead of eight.

There is another change: It uses one cpu instruction to load the 32bit integer and it gets "atomic" (if concurrency would be important - not relevant here). With the current code there are not only bounds checks removed, we explicitely use one CPU instruction and do not rely on hotspot to combine them.

I really think ByteOrder.nativeOrder() should be in forbidden APIs myself.

Would be a bad idea regading project Panama when you want to use native APIs :~~) I just say this :~~)

+1 to Adrien's idea to just use LE converter here. The problem is that no jenkins is testing lucene on bigendian platforms: by using native order (even if we think it is safe), it introduces risks. So personally I would rather see explicit endianness.

Actually the current code uses BE, so it is not buggy. To me the risk was only if we can read a BE compressed LZ4 stream when using an LE platform, which was confirmed above. In short: I'd like to se a performance comparison of purely the LZ4 decoder with LE and BE on x86 platform.

Theoretically we can test both encodings (this also makes sure that we can read old BE encoded indexes) and at runtime use platform's endianess.

asfimport commented 3 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Would be a bad idea regading project Panama when you want to use native APIs I just say this

I'm not worried about this, it is unclear they will ever actually release it.

So for NOW, i'd like to see ByteOrder.nativeOrder not used here, or anywhere. i really don't give a shit if a JMH benchmark is faster, bswap costs nothing, bugs are expensive. And nobody is testing on bigendian platforms. Please, Please, Please, do not use it, or the patch will have my vote against it.

asfimport commented 3 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

No problem, you convinced me!

I'd suggest we let go #11150 into main and then adapt the patch here to use the LE var handle for ints. The change to writeShort/readShort can also be appied or alternatively, we use the varhandle for shots, too.

I will set this issue as "requires LUCENE-10113".

asfimport commented 3 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

We have testing for the LE/BE difference already. TestBackwardsCompatiility will load an 8.x index, that used BE for encoding the LZ4 stored fields. So we are safe that it is interoperable.

asfimport commented 3 years ago

Robert Muir (@rmuir) (migrated from JIRA)

TestBackwardsCompatibility does not address my concerns about using ByteOrder.nativeOrder()

Sorry, I'm gonna stick with -1 against ByteOrder.nativeOrder():

no jenkins machine tests the case where this method returns BIG_ENDIAN
I've dealt with debugging index corruption caused by overeager LZ4 optimizations before. I won't be fooled again.

asfimport commented 3 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

TestBackwardsCompatibility does not address my concerns about using ByteOrder.nativeOrder()

Robert, calm down and maybe drink a coffee first! You just misunderstood me! In my first sentence I said that I agree.

My concern was just the following: In Lucene 8.x we have all LZ4 encoding with hashes based on BE integers (look at the "old" code). Our requirement is: An index saved on disk and created with those previous version must be readable in 9.0, regardless of the endianness we use. And thats asserted by the TestBackwardsCompatibility: It checks that BE-based hashes in 8.x are still readable and decode correctly in LE Lucene 9.

Again in very short: Test BackwardsCompatibility ensures that an index written with 8.x is readable and stored fields decompress correctly, although in 8.x we used BE and now we use hardcoded LE (as YOU suggested). This statement had nothing to do with nativeOrder().

asfimport commented 3 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Basically the discussion here is the same like the other open issues we announced for #11150: E.g. IDPostingsFormat should also change to LE instead of BE. We should open an issue and discuss if it is an index format change or if it just works (TM). Same for facet module.

So this issue is a good example. What I learned for LZ4: The encoding of the hashes used to find the similarities does not matter, so we can change from BE to LE in 9.x and this is no index format change.

asfimport commented 3 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

11150 was mreged, it is now time to adapt the patch here. I will open a new PR.

asfimport commented 3 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Hi Tim Brooks, here is your patch as pull request and I also adapted it to use the VarHandle in little endian, introduced by #11150: https://github.com/apache/lucene/pull/310

asfimport commented 3 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Hi, I had to revert the read/write of shorts, because this seems to cause errors. It looks like Lucene 9 sometimes uses a EndiannessReverserDataInput which causes test failures when reading 8.x indexes (which have different endianness). This endianness reversal cause the LZ4 stream to have wrong shorts.

So we can't rely on endianness of DataInput passed to our LZ4 method and should hardcode it when writing/reading the LZ4 format. So the PR only has the central readInt method for hashing.

asfimport commented 3 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

FWIW this speedup is especially relevant for Lucene now that we use dictionaries, since initializing hash tables from a dictionary is even more intensive on reading ints from byte[] than actually compressing the data.

asfimport commented 3 years ago

Tim Brooks (migrated from JIRA)

> So we can't rely on endianness of DataInput passed to our LZ4 method and should hardcode it when writing/reading the LZ4 format. So the PR only has the central readInt method for hashing.

Sounds good. As I mentioned, in my testing, the short optimizations were not that important.

asfimport commented 3 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

@jpountz: I dont fully understand how the dictionaries work. As they are somehow provided externally, do those work well with our change from BE to LE? Sorry for my ignorance, but I have no idea what the dictionary does and where it comes from (is it feeded from previous compressions of same index segment?).

If you can confirm that LE / BE does not matter if the dictionary is added externally or maybe merged from other segments with Lucene 8 codec, I will merge the PR and we will se what Mike's nighly indexing benchmarks say!

asfimport commented 3 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Sounds good. As I mentioned, in my testing, the short optimizations were not that important.

In Lucene 10 we can add them back because no endian reverser will kick in in backwards codecs :-)

asfimport commented 3 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

@uschindler By default, the only thing that LZ4 does is replacing strings with a previous instance of the same string that occurred in the last 64kB (this is where the short is coming from). Providing a dictionary helps improve compression rations by also being able to replace a string by an instance of the same string from the dictionary. This is especially useful when compressing small inputs.

Dictionaries could be external, e.g. if you're compressing Java source code, you could use a static dictionary that contains the most frequent keywords like public static... For stored fields, we use the data itself as a dictionary: when we need to compress a block of stored fields of \~84kB, we compress the first 4kB independently and then use these 4kB as a dictionary to compress 10 sub-blocks of 8kB.

asfimport commented 3 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

OK, thanks for the clarification. To me the code looked something like this, I just wanted to be sure.

So I will merge the PR now.

asfimport commented 3 years ago

ASF subversion and git services (migrated from JIRA)

Commit 5871ea797244fd395e428576b6698cee09a5e2f1 in lucene's branch refs/heads/main from Uwe Schindler https://gitbox.apache.org/repos/asf?p=lucene.git;h=5871ea7

LUCENE-10112: Improve LZ4 Compression performance with direct primitive read/writes (#310)

Co-authored-by: Tim Brooks <tim@timbrooks.org>

asfimport commented 3 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Thanks Tim Brooks.

asfimport commented 3 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I dont fully understand how the dictionaries work. As they are somehow provided externally, do those work well with our change from BE to LE?

I realized that I didn't specifically answer to your question about whether dictionaries would raise issues with a change of endianness and the answer is no.

asfimport commented 2 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Closing after the 9.0.0 release

apache / lucene

Improve LZ4 Compression performance with direct primitive read/writes [LUCENE-10112] #11149

11150

11150 was mreged, it is now time to adapt the patch here. I will open a new PR.