apache / datasketches-memory

High performance native memory access for Java.
https://datasketches.apache.org
Apache License 2.0
114 stars 27 forks source link

utf8 PR #14

Closed leerho closed 6 years ago

leerho commented 6 years ago

@leventov

I made some style and cosmetic changes to your PR which I pulled into the utf8 branch here. I had to add finals, make some methods static that could have been, etc. tools/SketchesCheckstyle should have caught those.

Question 1: You did not create a parallel get/put utf8 for buffer. Why?

Question 2: Have you done any testing that shows that adding this utf8 capability into Memory improves performance over what could have been done external to Memory?

leerho commented 6 years ago

@leventov

Also, if we were to add this to Buffer, we would need to put the DecodeUtil class in Util as package private.

BTW: This is beautiful work!

leventov commented 6 years ago

1) Because I don't need it yet.

I want to rewrite Druid's GenericIndexed and ObjectStrategy interface to use Memory instead of ByteBuffers. Currently I think the interface should look like this:

interface ObjectStrategy<ObjectType, ReusableStateType>
{
  ...
  ReusableStateType makeReusableState();
  ObjectType read(Memory mem, long offset, int numBytes, ReusableStateType reusableState);
  ObjectType readNewObject(
      Memory mem,
      long offset,
      int numBytes,
      ReusableStateType reusableState
  );
}

And STRING_STRATEGY could look like this:

class StringStrategy implements ObjectStrategy<String, StringBuilder>
{
  ...
  StringBuilder makeReusableState() { return new StringBuilder(); }
  String read(Memory mem, long offset, int numBytes, StringBuilder reusableStringBuilder)
  {
    reusableStringBuilder.setLength(0);
    mem.getUtf8(offset, reusableStringBuilder, numBytes);
    return reusableStringBuilder.toString();
  }
}

It's not Region-based, because Regions are mutable, so we couldn't pass the same Region when deserializing all objects. On the other hand, we don't want to create a new Region for each new object deserialized.

2) I didn't compare. I wouldn't be surprised if JDK's UTF-8 encoding/decoding is faster than this, because it's heavily optimized, probably even intrinsified. But the primary goal is to avoid creating garbage. The new interface of ObjectStrategy, which has getUtf8/putUtf8, allows to generate absolute minimum garbage possible, as long as Druid's code centers around Strings. And could be made garbage-free, if Druid uses CharSequences instead of Strings.

And for complex object columns, it would allow garbage free ser/de already, that I need for https://github.com/druid-io/druid/issues/4622. See also https://github.com/druid-io/druid/pull/5172

FYI @cheddar @gianm

leventov commented 6 years ago

P. S. it's not my work, I just copied some code from https://github.com/google/protobuf repo.

leventov commented 6 years ago

@leerho FYI I pushed another small commit to this branch: https://github.com/DataSketches/memory/commit/ca731bff34d7464b178fe03241f00b5209166f56

leerho commented 6 years ago

@leventov

The method region() is immutable, writableRegion() is mutable.

Nonetheless, I get your point about not wanting to create another object.

I would like to do some characterization tests just to see how severe a hit we do get with this compared to native java, which I think would require an extra copy that would only be in Eden.

I think we may be OK, but I want to check with some folks to make sure we don't have a licensing issue.

leventov commented 6 years ago

Licensing issue with protobuf, you mean? There shouldn't be, because protobuf code is BSD-licensed, that is compatible with Apache.

leerho commented 6 years ago

@leventov

Yes, although it doesn't specifically name the license, I did confirm that it is exactly the "3 clause, BSD License 2.0".

BTW you might be interested in this thread.
It is a little old now, but it discusses several performance factors such as typical string length and whether the strings are typically mostly ASCII. These factors appear to affect whether the simpler String.getBytes() performs better or the direct encoding/decoding in protobuf performs better.

Most of the Java String.getBytes() implementation issues they mention are still in the Java 8 code (e.g., the 4x oversize allocation of the byte[] on encode).

I also had the thought (like David Dabbs) of directly accessing the String value array via reflection for read-only. It would reduce a copy for encode. Kenton Varda does not like the idea, but we are already down that path with ByteBuffer and Unsafe. Protobuf already uses Unsafe, so not sure Kenton's objection is still valid.

I would suggest we implement some temporary methods getUtf8Alt() and putUtf8Alt() which would use the much simpler String.getBytes() and String constructors and then run it on some actual Druid data and measure the encode / decode times and GC times; then run the same tests with the methods you have proposed. Is this something you could do?

leventov commented 6 years ago

I don't see the point of doing this. There are already methods to get and put byte array from/to Memory, it could be the intermediate step for JDK-based encoding, if somebody wants to do this. There is no need for implementing special methods in the Memory library.

Methods that I added are garbage-free, this is the reason for their existence. If somebody values speed above producing more garbage, he could use JDK as described above. BTW there is no guarantee that JDK is faster, we just assumed that. But is doesn't matter.

leventov commented 6 years ago

@leerho what is the point of disagreement here?

leerho commented 6 years ago

Not a major disagreement, but we are missing two things:

Because this Library is used for applications beyond what you and I might envision, we need to be as clear as possible to users why we are adding these Utf8 methods into the library and when these methods should be used.

Part of the delay has been my working on getting this characterization framework put together to run our characterization tests (some of which can run for many hours!). Now that that is pretty much done, I have started working on designing some characterization tests for the Udf8 encode speed and decode speed and compare that with the String methods. I also want to instrument it so we can demonstrate differences in GC performance as well.

Once we have adequate user guidance and the data to back it up. I will feel comfortable to merge.

leventov commented 6 years ago

Thanks for this work.

Actually there is a way to achieve low garbage rate with JDK as well - but it requires to first copy bytes from Memory to ByteBuffer, then invoke some CharsetDecoder with cached CharBuffer, and then create String from the CharBuffer. But since this path is not as optimized as new String(byte[], "UTF-8"), I expect that specialized reading method in memory should be faster.

leerho commented 6 years ago

@leventov

The first data is in, and it is quite a surprise!

utf8speedbycp_t16cp12

These plots are normalized as nanoseconds per CodePoint. The X-axis is # of CP per Trial. The number of trials at each plot point is 2^16 (LgT=16), and there are 4 plot Points-Per-Octave (PPO).

The CP are randomly selected from the entire valid CP range, thus nearly all of the CP will be from the supplementary planes (planes 1 to 16), not the BMP and require two surrogate characters in the String form and 2 or 3 bytes in UTF8. I might try another test just using CP from the BMP as the encoding and decoding should be faster.

Nonetheless, I was really surprised to see the Memory Encoding to be faster than the Java Encoding, by 25%! The Memory Decoding is slower than the Java Decoding, but only by about 15%. Not bad!!

The next test will be to measure GC. I did have to insert System.gc() prior to each Trial Set to eliminate GC spikes in 3 of the plots (JavaEnc, JavaDec, and MemDec). The MemEnc did not have GC spikes!

This means that there still is some GC activity in the MemDec test, but this may be something I am doing in the test instead of the MemDec function itself.

All this code will be checked in shortly. Please take a look and run it if you are curious.

leerho commented 6 years ago

@leventov

To run this test: Run com.yahoo.sketches.characterization.Job and provide it a single string argument of "src/main/resources/memory/HeapUtf8SpeedJob.txt". The dependencies are sketches-core-0.10.3.jar, testng (which requires commander), and the current memory Utf8 branch.

leerho commented 6 years ago

@leventov

The current test setup is not free of garbage generation during the trial iterations and I could make it a lot cleaner. I cannot conclude that the MemDec test is garbage free, it may have been a coincidence.

leerho commented 6 years ago

@leventov

I want to discuss a different topic and that is the design of the method APIs:

GET: void getUtf8(long offsetBytes, Appendable dst, int utf8Length)

PUT long putUtf8(long offsetBytes, CharSequence src)

Please give me a code snippets of how you would use these in practice, especially in the case where you are storing multiple strings in a single Memory.

leerho commented 6 years ago

@leventov

Curious why getUtf8() throws Utf8CodingException (extends RuntimeException) for both a bounds problem as well as illegal byte sequence.

Meanwhile, putUtf8() throws a IllegalArgumentException for a bounds problem and the UnpairedSurrogateException (extends IAE) for the essentially illegal character sequence problem.

I think we could use the same Utf8CodingException class for all 4 cases and distinguish them with separate and clearer messages.

I have started down this path, but wanted to understand your reasoning here.


Also noted that asserting state.isValid() was missing on both get and put. (I already added)

I'm also adding more code comments as I go through the code. :)


Also, in Utf8.putUtf8(), the last "else":

If I have grokked the logic correctly, at the last "else" we know we have a valid surrogate character in hand (don't know which) and we have less than 4 bytes of room to place a 4-byte encoding of what should be a surrogate pair. It is possible to have both an unpaired surrogate problem and a bounds problem.

In the preceding "else" we don't make a distinction: either a bounds problem or a surrogate pair problem throws the same exception.

I think we can separate these problems in the exception logic, so that it doesn't add any cycles into the path. Unless I'm missing something, I will go ahead and clarify this.

leerho commented 6 years ago

@leventov

In Utf8.putUtf8() line 112. After dispatching an ASCII sequence you need to subtract the cumBaseOffset! Fixed!

This illustrates that we need some tests that actually put and get multiple sequential strings. Could you contribute that?

leventov commented 6 years ago

The problem I am having with the method name is that it is not analogous with getLongArray(). One is not getting Utf8 as a result. We could name it getCharArrayFromUtf8() or getCharsFromUtf8() or decodeUtf8Bytes()

I agree, I like getCharsFromUtf8() of those options.

How do we expect the user to know what the utf8Length is? Suppose the Memory has a set of strings stored in Utf8. How does one know the length of each group of bytes? If the internal structure of the Memory was {int, byte[], int, byte[], ...} then it would be possible. This, in effect, would make a specialized positional interface, which might make more sense as part of Buffer.

I didn't understand the last part of this, but in Druid (in GenericIndexed), sizes of objects are stored in a completely different area of memory: there is "array of lengths", and then "array of contents".

I am also thinking about providing more flexibility similar to getPrimitiveArray() where we allow specification of a destination offset and length. But the relationship between Utf8 bytes and chars is complex. Not sure how this would be used for characters outside the BMP.

Also didn't understand this. What is BMP?

getUtf8 (or getCharsFromUtf8, if we rename) uses Appendable as output, that's flexible enough, I think. Even CharBuffer implements Appendable, you can set CharBuffer's position prior to calling this method, to achieve output with offset. When StringBuilder is used as an argument, I don't see the cases for specifying output with offset, different from "after the last char in StringBuilder" (and you call StringBuilder.setLength() to leave the needed prefix, particularly setLength(0), if you don't need a prefix.)

I have similar suggestion about the putUtf8() method name: long putCharsToUtf8(), etc.

Yes, I agree with putCharsToUtf8(), for symmetry.

The way this method is specified it assumes that the entire CharSequence is to be encoded. Would additional parameters to allow an offset and length makes sense? The complications of planes outside the BMP remaining.

I don't see cases for this. And CharBuffer's "char sequence" is between CharBuffer's position and limit, you could use it to emulate offsets. Again, what is BMP?

Please give me a code snippets of how you would use these in practice, especially in the case where you are storing multiple strings in a single Memory.

Pretty much the only (but important) use case for this pair of methods in Druid is here: https://github.com/DataSketches/memory/issues/14#issuecomment-353591816, it doesn't store multiple consecutive strings.

leventov commented 6 years ago

In Utf8.putUtf8() line 112. After dispatching an ASCII sequence you need to subtract the cumBaseOffset! Fixed!

This illustrates that we need some tests that actually put and get multiple sequential strings. Could you contribute that?

Thanks. I didn't add multi string tests (don't really see how they would help), but I improved test coverage so that it would have caught this bug.

leventov commented 6 years ago

I renamed getUtf8 to getCharsAsUtf8 and putUtf8 to putCharsAsUtf8.

leerho commented 6 years ago

BMP = Basic Multilingual Plane

leerho commented 6 years ago

@leventov

CharSequence.subSequence(int start, int end) already allows for an offset into the sequence so why wouldn't we ? (instead of manually having to set a position).

leerho commented 6 years ago

Also Appendable.append(CharSequence csq, int start, int end) has the same concept.

long putCharsToUtf8(long offsetBytes, CharSequence src) returns the address of the next byte. Why would you need this unless you were going to put something in the Memory after the sequence of Utf8 bytes? Presumably, the Memory is already big enough and you already know its capacity?

leerho commented 6 years ago

What I gather you are saying is that the Memory would always be used (in your environment) as a stand alone container for a single Utf8 sequence of bytes. If that is the case, why do we even need utf8Length in getCharsFromUtf8(), especially if it is always set to Memory.getCapacity()?

leerho commented 6 years ago

Another idea would be to have the offsets and lengths specified in Code Points rather than characters. This would be easy to do and even Java does not do this very well. Once one understands the concept of CP, they are much less ambiguous than characters outside the BMP and are equivalent to characters inside the BMP.

I'd have to think some more, but I think if we used CP, it would be possible to insert multiple CP sequences into a Memory without separators. You would keep a separate list of CP lengths, like you do now, but it would likely reduce errors caused by byte and character misalignments.

leventov commented 6 years ago

CharSequence.subSequence(int start, int end) already allows for an offset into the sequence so why wouldn't we ? (instead of manually having to set a position).

It produces garbage.

long putCharsToUtf8(long offsetBytes, CharSequence src) returns the address of the next byte. Why would you need this unless you were going to put something in the Memory after the sequence of Utf8 bytes? Presumably, the Memory is already big enough and you already know its capacity?

Returns offset to not need to compute utf8 length once again. It's needed even in Druid's case, i. e. when Memory is not used for serial put of different things.

What I gather you are saying is that the Memory would always be used (in your environment) as a stand alone container for a single Utf8 sequence of bytes. If that is the case, why do we even need utf8Length in getCharsFromUtf8(), especially if it is always set to Memory.getCapacity()?

No, one giant Memory is used for many strings. See GenericIndexed class in Druid.

Another idea would be to have the offsets and lengths specified in Code Points rather than characters. This would be easy to do and even Java does not do this very well. Once one understands the concept of CP, they are much less ambiguous than characters outside the BMP and are equivalent to characters inside the BMP.

I'd have to think some more, but I think if we used CP, it would be possible to insert multiple CP sequences into a Memory without separators. You would keep a separate list of CP lengths, like you do now, but it would likely reduce errors caused by byte and character misalignments.

Basically all Java code and APIs that I have seen are centered around chars (Strings, StringBuilders, CharSequences, Jackson, Regex parsing, etc), including Druid, so I don't see the value in being "smarter" here in Memory.

leerho commented 6 years ago

@leventov

getCharsAsUtf8() is just not as clear as getCharsFromUtf8() and doesn't make any sense. "As" has many uses (and abuses), and you can research the etymology yourself, but whether it plays the role of and adverb, conjunction, pronoun or preposition, the word just doesn't work in this context.

Here it is trying to be an adverb, "get ... as ...", and associates a quality, "Utf8", to the noun "characters". But characters are not Utf8, nor do they ever have that quality. One could say "get characters as Utf16", or "get characters as ASCII", because these describe a quality or feature of how the sequence of characters are encoded and characters clearly can have those qualities.

This same difficulty also extends to "putCharsAsUtf8()", although it is more subtle. Here we are effectively saying "put characters that have the quality Utf8". But, again, characters never have that quality. "putCharsToUtf8", is much, much clearer.

If you mentally substitute "bytes" for "Utf8", it may help. Utf8 has the quality of bytes, and bytes can have the quality Utf8. getCharsFromBytes() and putCharsToBytes() is what we are doing. These are not just any kind of bytes, they are, in fact, Utf8 bytes.

leerho commented 6 years ago

@leventov

I have been looking at GenericIndexed. I presume that this is the data file that contains the metadata offsets and lengths of the stored objects and contains the stored values as well. If this is the case, then GI is a effectively a structure with a preamble followed by numElements integers of offsets, followed by pairs of numElements * {int length, byte[] value}. (Am I close?)

Clearly the offsets and lengths are computed and inserted as the values are discovered and stored, so it all works. And, once constructed, the proposed getCharsFromUtf8() and putCharsToUtf8() works in this environment. Nonetheless, it is an assumption that the user must know and has ready access to the number of Utf8 bytes before decoding them.

I could well imagine an environment where the user only knows the number of characters or code points that need to be decoded and not the number of bytes. I realize that these variants could be added later. I just want to make the assumptions clear, which means the Javadocs should make this clear to the user as well.

Also, I would recommend a different ordering of the arguments to getCharsFromUtf8(long offsetBytes, int utf8LengthBytes, Appendable dst).

This is because the utf8LengthBytes is closely associated with offsetBytes and has nothing to do with the destination. If you look at getPrimitiveArray(long offsetBytes, prim[] dstArray, int dstOffset, int dstLength), the final 3 arguments are all relevant to the destination type, which may not be bytes. I don't want users to be confused by these different argument layouts.

leerho commented 6 years ago

TODO:

leventov commented 6 years ago

About performance: put performance is not relevant for GenericIndexed, only get. I don't trust non-jmh benchmarks, so I created my own: https://github.com/leventov/memory-jmh-benchmarks, benchmarks of get without producing garbage.

It turned out that append to StringBuilder is much worse than to CharBuffer, because of StringBuilder's ensureCapacity() calls on each char append. Hotspot JIT was also not able to optimize CharBuffer use itself, and manual copy paste in Utf8 class in Memory was needed, along with some other opts..

Now performance numbers that I see are:

Benchmark                                  (ascii)  (len)  Mode  Cnt   Score   Error  Units
Utf8ReadBenchmark.byteBufferDecode            true     10  avgt   10   1.734 ± 0.047  us/op
Utf8ReadBenchmark.byteBufferDecode            true     30  avgt   10   2.477 ± 0.077  us/op
Utf8ReadBenchmark.byteBufferDecode            true    100  avgt   10   5.632 ± 0.241  us/op
Utf8ReadBenchmark.byteBufferDecode           false     10  avgt   10   3.343 ± 0.090  us/op
Utf8ReadBenchmark.byteBufferDecode           false     30  avgt   10   7.093 ± 0.178  us/op
Utf8ReadBenchmark.byteBufferDecode           false    100  avgt   10  18.768 ± 0.435  us/op
Utf8ReadBenchmark.memoryGetBytesAndDecode     true     10  avgt   10   2.191 ± 0.060  us/op
Utf8ReadBenchmark.memoryGetBytesAndDecode     true     30  avgt   10   3.179 ± 0.052  us/op
Utf8ReadBenchmark.memoryGetBytesAndDecode     true    100  avgt   10   6.176 ± 0.140  us/op
Utf8ReadBenchmark.memoryGetBytesAndDecode    false     10  avgt   10   3.928 ± 0.142  us/op
Utf8ReadBenchmark.memoryGetBytesAndDecode    false     30  avgt   10   7.827 ± 0.258  us/op
Utf8ReadBenchmark.memoryGetBytesAndDecode    false    100  avgt   10  19.612 ± 0.375  us/op
Utf8ReadBenchmark.memoryGetCharsAsUtf8        true     10  avgt   10   1.438 ± 0.038  us/op
Utf8ReadBenchmark.memoryGetCharsAsUtf8        true     30  avgt   10   2.624 ± 0.054  us/op
Utf8ReadBenchmark.memoryGetCharsAsUtf8        true    100  avgt   10   6.746 ± 0.116  us/op
Utf8ReadBenchmark.memoryGetCharsAsUtf8       false     10  avgt   10   2.735 ± 0.063  us/op
Utf8ReadBenchmark.memoryGetCharsAsUtf8       false     30  avgt   10   6.693 ± 0.287  us/op
Utf8ReadBenchmark.memoryGetCharsAsUtf8       false    100  avgt   10  19.946 ± 0.650  us/op

Decode of long ASCII-only Strings is still faster with JDK, because Hotspot optimizes byte copy from byte[] to char[] really well, unlike Unsafe access - to - char[]. It's understandable.

leventov commented 6 years ago

Also see - https://github.com/DataSketches/memory/commit/dd0638f2df2a6a4fb44011e54463d5c337f86259

I think in general all methods dealing with arrays, or series of bytes, not just single primitive get/put of 1/2/4/8 bytes, should always check bounds, not assert bounds. The idea of assert is that in ser/de code, like

putInt(field1);
putLong(field2);
...

Bounds are still checked, not asserted, in the beginning of the procedure. Because the procedure knows how many bytes in general it's going to put or get. And assert bounds inside each method are left "just in case", but not duplicate the "global" check, when assertions are off.

But with methods putting/getting arrays or sequences of bytes, including putChars/getChars, we shouldn't expect users to make such extra checks beforehand, because they would just duplicate the checks made inside the methods. And also, the cost of this extra check should most often be negligible compared to the cost of the "body" of the bulk operation.

leerho commented 6 years ago

Thanks for running the jmh tests.

Nonetheless, I'm having difficulty relating your measurements to mine. You are measuring microseconds for an entire benchmark. I was computing the effective time / codePoint. In terms of the work performed, I can correlate my numbers pretty closely with wall-clock time.

I would not expect JIT to kick in with only 5 warmups and 10 measured iterations. As a result I haven't found jmh numbers to be very useful. And with only 10 iterations, the +/- error could be misleading. I have found the jmh overhead per benchmark to be quite large, which precludes running many thousands to millions of trials. As a result, I haven't developed any trust in jmh numbers :( What am I missing?


WRT bounds checking. You have a good point.

leventov commented 6 years ago

Nonetheless, I'm having difficulty relating your measurements to mine. You are measuring microseconds for an entire benchmark. I was computing the effective time / codePoint. In terms of the work performed, I can correlate my numbers pretty closely with wall-clock time.

The time in my benchmark is just "bananas", only useful within the benchmark.

I would not expect JIT to kick in with only 5 warmups and 10 measured iterations.

It can be actually seen in real time when you run this benchmark. It depends on the benchmark (for some you note is true), but in this particular one C2 final compilation happens after 2 and more rarely after 3 or 4 warmup iterations.

And with only 10 iterations, the +/- error could be misleading. I have found the jmh overhead per benchmark to be quite large, which precludes running many thousands to millions of trials.

It doesn't run just 10 iterations, it runs per 10 seconds, i. e. in this bench (say 10 us / op) it runs 100k ops per second, or 1 million in total. Running more doesn't improve precision, because the machine itself is not deterministic, many things happen in background, etc. I find current precision acceptable.

Also, without JMH, you cannot easily look at ASM of the hot loop and distribution of time spend at each instruction, that is essential to understand that you didn't make a mistake in the bench, measure the real work, don't have obvious bottlenecks/suboptimalities, and determine how you could speed your method up.

leventov commented 6 years ago

Anyway, do you thing utf8 branch could be merged now?

leventov commented 6 years ago

"Iteration" in JMH is not literally one run of a benchmarked method, the latter is called "invocation" in JMH terminology. "Iteration" is specified by @Measurement class level annotation, by default the duration of each "iteration" is 1 second, and as many as possible "invocations" are done during that one "iteration". Time of each "invocation" is not measured, only the number of "invocations" is counted, which are able to fit in 1 second. Then 1 second divided by num of invocations fitted in is considered the "avg time" of "invocation" in this "iteration", that is presented everywhere as the time of "iteration", that is probably confusing.

leerho commented 6 years ago

Thanks for your brief explanation. I am used to doing probabilistic error measurements of accuracy in stochastic algorithms, where I have to have good control over the number of trials in order to compute accurate error distributions. For these kinds of measurements, jmh is just out of the question.

But the ability to examine the assembly instructions would be a big win. I will have to study up on JMH for this.


I think we are pretty close. I have reworked the exception handling and improved the messages based on the different types of errors. I am doing code coverage now.

leerho commented 6 years ago

@leventov

Our last edits crossed :(. So it took me a bit to clean it up, but I think it is near ready to merge.

Please take a close look at it. Now you should see the much more detailed exception handling. I had to pass in cumBaseOffset into the new methods you had created for the error messages ... absolute memory addresses are useless in an error message.

I also redesigned the last section of putCharsToUtf8() to streamline the code and make it easier to understand. Added lots of comments as well.

My CheckStyle suddenly stopped working, so I need to get it working again and then run it against that and FindBugs.

leerho commented 6 years ago

The number of uncovered elements is pretty high. We have only 86% coverage. We are mostly missing coverage of the CharBuffer methods.

leerho commented 6 years ago

Coverage is now improved, but there are some parts of Utf8 that are really hard to hit :(

FindBugs passes, checkstyle is passing. Merged to master.

leventov commented 6 years ago

@leerho thank you