densh / scala-offheap

Experimental type-safe off-heap memory for Scala.
BSD 3-Clause "New" or "Revised" License
532 stars 38 forks source link

UTF8 string serialization #15

Open velvia opened 9 years ago

velvia commented 9 years ago

Strings form a large portion of many objects. Just storing a pointer to the on-heap String object is not a practical way to reduce GC pressure. Instead, how about having a UTF8-based string wrapper class that can offer support for basic operations:

equals()
startsWith()
maybe contains()

other more complex methods can be delegated to the native Java/Scala string class by serializing to a string on-heap on demand, but the above would offer enough support for simple things like HTTP or JSON parsing.

The goal is to allow for basic fast string operations without the expensive conversion and object allocation to serialize UTF8-encoded strings to UTF16-native Java byte format.

densh commented 9 years ago

I think that having support for offheap strings in the API is a great idea. I'm not sure about details of the implementation yet, but I'll update the issue once I have some more concrete thoughts on the topic.

andresilva commented 9 years ago

I agree that the conversion to String is indeed expensive and incurs an unnecessary object allocation. Still, you'll have a hard time beating the performance of String#equals() since the JVM has an intrinsic method that uses SSE4.2 instructions to do the comparison. You might be able to use Arrays.equals (which is also intrinsic) but then you'd incur an allocation since you need to create a byte array from off heap memory. I'm curious to see what you come up with. :smile:

velvia commented 9 years ago

Unsafe has memcopy, too bad it doesn't have memcompare... :(

-Evan "Never doubt that a small group of thoughtful, committed citizens can change the world" - M. Mead

On Mar 20, 2015, at 4:47 PM, André Silva notifications@github.com wrote:

I agree that the conversion to String is indeed expensive and incurs an unnecessary object allocation. Still, you'll have a hard time beating the performance of String#equals() since the JVM has an intrinsic method that uses SSE4.2 instructions to do the comparison. You might be able to use Arrays.equals (which is also intrinsic) but then you'd incur an allocation since you need to create a byte array from off heap memory. I'm curious to see what you come up with.

— Reply to this email directly or view it on GitHub.

densh commented 9 years ago

JNI might be the answer here. Considering the fact that we don't need to copy any data over (as the data is already effectively allocated in C heap) we wouldn't have much performance overhead. Of course we need to benchmark to validate this.

arosenberger commented 9 years ago

Hi Denys,

With the jemalloc JNI binding, we can add utility functions as well to expose low level operations from or potentially SIMD instructions. I think for the latter case we might have to be careful as to chipset family for the target platforms. I can dig into some of the hotspot code from openjdk and check their implementation. For now I can put this work into a parallel branch while we flush out the jemalloc binding and just plan to include that in the JNI library that houses jemalloc.

densh commented 9 years ago

@arosenberger Please don't use GPL code bases as a reference. We use Scala license (3-clause BSD derivative) for our code and can only borrow implementation ideas from software with compatible license. Otherwise we might get in to legal trouble some day even if we don't borrow any code. (Note to self: this really needs to be documented somewhere.)

densh commented 9 years ago

@arosenberger I think that we need to concentrate on getting 0.1 out before we proceed with this. I'm afraid there are lots of corner cases in string support and it will take a while to get it right.

arosenberger commented 9 years ago

Thanks for the heads up on the GPL. I'll focus on finishing up jemalloc and adding the ArrayOps methods from the other issues. We can revisit this one down the road.