Add DataKernel serializer (with ASCII optimization)

vmykh commented 8 years ago

Hi, I want to add Datakernel serializer.

It's extremely fast and space-efficient serializer, crafted using bytecode engineering.

As you can see from benchmark, it is ~1.5 times faster than closest opponent (considering serialization + deserialization total time).

Here you can find more info and examples. Source code can be found in this place. Datakernel serializer is also available on maven central.

To create jar file, Maven Shade Plugin was used, which allows to include all dependencies in jar (such as ObjectWeb ASM) and avoid conflicts if you're using same dependency but with different version. Such kind of jar is called UberJar. More Details.

myshzzx commented 8 years ago

Datakernel serializer is so limited and it can be used in only a few situations.

you need to add an annotation like "@Serialize(order = 0)" to each "get" method of field that you want to serialize
before deserializing, you need to give it a fixed byte buf (no stream support), tell it what type/class it is (no type recognize), then it can do its job
before serializing, a big enough byte buf is needed, or an "ArrayIndexOutOfBoundsException" will be thrown

That's all I know from its tutoiral. I haven't test its performance, but fast is far away from enough, there is a long way for it to become a productive component. Tell me if there's any wrong here.

vmykh commented 8 years ago

myshzzx, thank you for feedback. Here are some reasons why this approach was chosen:

Explicit ‘order’ attribute ensures stability of binary format: fields are ordered to survive code refactorings and rearrangement of fields, and also for versioning and maintaining of backward compatibility. And some JVMs do not guarantee retrieving of fields in any particular order in runtime, which is needed for runtime generation of serializer.
The emphasis of this serializer is on speed, it is based on byte buffer operations which do not incur the overhead of regular Java streams.
Using (and re-using) a fixed buffer gives maximum performance, ensures the maximum message size, and plays nicely with GC.

myshzzx commented 8 years ago

That's a big sacrifice, for extreme speed. It fills a blank of jvm-serialization, as I know.

cowtowncoder commented 8 years ago

I think it is fine to discuss various trade-offs of codecs on the mailing list, but I do not think there is any specific criteria for inclusion, either regarding performance or implementation or even limitations. That is, I don't think anyone objects to inclusion, as long as test case is fair and codec works well enough to pass verifications and so forth.

vmykh commented 8 years ago

Well, actually, DataKernel Serializer supports UTF8 (by default) and ASCII, UTF16 strings (using annotations), allowing to specify the best one depending on data being serialized. According to jvm-serializers wiki, there is subcategory of manually optimized serializers so it's allowed to use parameters, that provide best result.

ASCII optimization is quite similar to ‘Nullable’ optimization, which is common in other serializers.

Nevertheless, running the benchmarks without ASCII annotation gives a little bit slower but still the fastest ‘total’ times for DataKernel serializer comparing to other serializers, according to our measurements.

cakoose commented 8 years ago

A tangent regarding the ASCII optimization: basically, I think we shouldn't allow that optimization in the benchmark.

I think it's helpful to differentiate between the schema definition and the actual test value. Even though test values happen to only use ASCII, the schemas are meant (in my opinion) to allow Unicode fields.

I think this is reflects a common real-world scenario: most string fields are defined to be Unicode, even if the data is usually ASCII-only. So while it's fair to have a fast path for ASCII data (many of them do), I think it's unfair if the serializer fails on Unicode.

Similarly, we only allow the non-null optimization if the schema specifies that a field is non-null. I don't think it's fair to apply the optimization to a nullable field just because the test value happens to never be null.

On Mon, Sep 28, 2015 at 10:26 AM, Vitaliy Mykhalko <notifications@github.com

wrote:

Well, actually, DataKernel Serializer supports UTF8 (by default) and ASCII, UTF16 strings (using annotations), allowing to specify the best one depending on data being serialized. According to jvm-serializers wiki https://github.com/eishay/jvm-serializers/wiki, there is subcategory of manually optimized serializers so it's allowed to use parameters, that provide best result.

ASCII optimization is quite similar to ‘Nullable’ optimization, which is common in other serializers.

Nevertheless, running the benchmarks without ASCII annotation gives a little bit slower but still the fastest ‘total’ times for DataKernel serializer comparing to other serializers, according to our measurements.

— Reply to this email directly or view it on GitHub https://github.com/eishay/jvm-serializers/pull/55#issuecomment-143812463 .

cowtowncoder commented 8 years ago

I fully agree with @cakoose. I don't see a problem in codec optimizing for likely cases (of, say, ASCII), but I think that it must be capable of handling non-ASCII. And this is one reason why perhaps we should force use of couple of non-ASCII characters -- I know there are 4 different test files, some with them. But maybe all of them should have a few sprinkled around to make sure codecs do not take unrealistic shortcuts; codec that would be ASCII only would be of quite limited value in my opinion. Even if majority of content was ASCII: there's always "that one document" where this is not true and things break.

vmykh commented 8 years ago

I agree that Unicode support is usually needed (and DataKernel serializer still gives great results with Unicode, so we are ok with UTF8).

But I think in many cases ASCII optimization can be really useful:

data, that can be ASCII only (according to standards), namely fields like "uri" or "format" (MIME / MediaType). For example, in URI all non-ASCII characters must be percent-encoded link
various string ids, tokens
hashes, passwords, randomly generated keys
database fields with non-Unicode encoding

Maybe ASCII optimization should be allowed for some kind of fields like "uri" and "format", and disallowed for the rest of fields, which could potentially contain Unicode? In doing so, wouldn't it give more balanced and realistic benchmark?

cowtowncoder commented 8 years ago

@vmykh If schema (whatever the source; external, based on Java class definition) indicates that the datatype can NOT contain non-ASCII content, sure. I don't have specific objection to denoting that for fields you mention, and leaving others like "title", "description" as full Unicode.

cakoose commented 8 years ago

I think the ASCII optimization is a useful one to make. Many users will see real-world benefits from it. But the goal of this benchmark, I believe, is to be as useful as possible to as many people as possible, and I'm not sure if incorporating the ASCII-only optimization will aid in that goal.

Long story (again, sorry):

Having been around this project for a long time has taught me how difficult it is to make good benchmarks in general and, specifically, how misleading ours might be. One of the biggest issues is that we only have a single test value. It's something Eishay Smith came up with in 2008 when he wanted to do a quick comparison of Thrift vs Protobuf.

We've since made changes where we thought the benefit was unambiguous. For example, the test strings used to be "a" and "b" and the test numbers were 1 and 2 -- we switched to longer strings and bigger numbers.

But the fact remains that our test value is still very biased. For example, there are real world use cases that involve serializing tons of numbers and those use cases aren't represented in our benchmark. Changing the test value to some other reasonable value could end up changing the results by maybe 25%. Given that, it seems silly to me that we report results with the number of significant figures that we do! (I always worry about people taking our results too seriously.)

One way to improve this project is to have multiple test values that try and cover different use cases (ASCII-only, lots of numbers, short strings, long strings, etc) [1]. This would allow a user to look say "hmm, my values have lots of ASCII and lots of numbers" and pay more attention to those particular result sets.

But as it stands now, we're only using a single test schema/value. To me, it's not clear whether changing that schema/value to have ASCII-only strings will help towards the end goal of making the results more useful for more people.

On Wed, Sep 30, 2015 at 1:13 PM, Tatu Saloranta notifications@github.com wrote:

@vmykh https://github.com/vmykh If schema (whatever the source; external, based on Java class definition) indicates that the datatype can NOT contain non-ASCII content, sure. I don't have specific objection to denoting that for fields you mention, and leaving others like "title", "description" as full Unicode.

— Reply to this email directly or view it on GitHub https://github.com/eishay/jvm-serializers/pull/55#issuecomment-144526821 .

vmykh commented 8 years ago

Okay, I've updated this pull request, and now ASCII optimization is used only for "uri" and "format" fields. Alternatively, I've also created another pull request, where serializer doesn't use ASCII optimization at all. So you can decide which one better conforms requirements and then merge it.

pascaldekloe commented 8 years ago

Favoured #56 over ASCII optimizations.

eishay / jvm-serializers

Add DataKernel serializer (with ASCII optimization) #55