fabienrenaud / java-json-benchmark

Performance testing of serialization and deserialization of Java JSON libraries
MIT License
979 stars 135 forks source link

Weird setup leading to surprising results? #69

Open rmannibucau opened 1 year ago

rmannibucau commented 1 year ago

Hi,

Not sure I misread the setup but there are generally two main categories of implementations out there:

Indeed it depends several things and even if byte based ones are supposed to be faster, char based ones are generally relevant when the chain uses char based objects (reader/writers in servlet api for ex) and delegate the char->byte handling to some backbone like servlet container to not worry about the impl details or similarly in the case of files.

The issue with current setup is that depending the provider it uses either bytes based inputs/outputs, or some not optimal converters including StringBuilder or plain String method shortcuts. So overall we can't compare the same chain between providers nor pick the performances aligned with our app use case easily (the switch can >> x2 or /2 the perf).

So overall it can be worth testing the different flavors and even using the impl shortcuts when possible which enables to provide directly a char[] to the json parser or equivalent when relevant.

Romain

rbygrave commented 1 year ago

Are you saying that byte based parsers/generators have an advantage in some tests here?

and delegate the char->byte handling to some backbone like servlet container to not worry about the impl details

Noting that the fastest parsers and generators are using "pre-encoded keys" to skip the encoding + escaping of keys (and imo they get a lot of performance advantage from this by for example skipping the byte <-> char encoding/decoding altogether). Some json libraries are doing this on BOTH parsing and generation and some ONLY on generation (but the benefits of using pre-encoded keys can been seen for both parsing and generation and is a function of the ratio of content that is keys vs data). Hence instead of servlet api readers/writers some json libraries prefer servlet inputStream/outputStream and processing bytes because some byte <-> char can be completely skipped altogether.

A servlet container can also add additional encoding/decoding like compression of course (and then we get into buffering details).

rmannibucau commented 1 year ago

Are you saying that byte based parsers/generators have an advantage in some tests here?

More than tests are not comparable (string:stringbuilder ones will likely be slower than byte based ones for ex - assuming rest is iso).

I'm not speaking of the impl/preencoding there but really the "stream pipeline". Typically if you add a bufferedreader (writer) you can be slower for several impl due to different buffering sizes or the additional encoding compared to the lib optimized flavor (directly using bytes and only converting to char when needed like jackson does for ex).

So overall some alignment and both cases should be reported to be relevant IMHO.

fabienrenaud commented 1 year ago

Valid concern.

This benchmark was originally designed to test fastest (theoretical) code paths for de/serializing json and help devs pick a json lib in general, not in particular context like servlet-api.

We could introduce more tests that would do this sort of apples-to-apples comparison for "char[]" based coders only. Just make to pick the type of input/output that would give the lowest overhead / best perf in the servlet-api context. Best to control this via a new Api flag so we can easily generate distinct results for it.

Contributions are welcomed.

SentryMan commented 1 year ago

a lot of these libs don't have native integration with servlet libraries, so to use you'd have to directly mess with input/output streams anyway, skipping the servlet readers and writers.

rmannibucau commented 1 year ago

well, servlet was really just a sample, it is exactly the same with plain files, any network streams etc. basically you have two choices: use byte data and handle at json layer the encoding - works well for built-in encodings (UTF-8 mainly) or delegate to a charset the handling. From a caller perspective reader/writer are always safer if contextuals (but a bit slower - that said we speak of perf negligible as soon as you add any I/O ;)) than byte and multiple encoding layers.

As usual there is no silver bullet so depends a bit the context but just wanted to highlight benching in a relevant context can need tuning of the suite and that default results should be taken with caution/review (not blaming, benching in a relevant manner is hard).