Properly handle UTF-8 keys and values

casperisfine commented 4 years ago

rb_str_new and rb_intern create strings with ENC_CODERANGE_7BIT (Encoding::US_ASCII), which like it's name implies doesn't allow for anything outside of the ASCII range.

According to the JSON spec, string should be assumed to be UTF-8, so IMHO it's better to just go with that.

Obviously this is a bit slower, but is required for corectness.

anilmaurya commented 4 years ago

@casperisfine How much slower, do you have any benchmark for it ?

I am thinking, if it's making huge performance impact then we can give option to select ASCII or UTF-8, what do you think ?

casperisfine commented 4 years ago

I'll try to benchmark it. But long story short, if the symbol already exist rb_intern is able to not allocate a String. It directly lookup the symbol table with the char * (I'm simplifying but that's the idea).

So this means each key will always allocated a string, generating more objects to be gced. So it's not really much slower immediately, but generates a bit more GC pressure.

we can give option to select ASCII or UTF-8, what do you think ?

It's possible yes. But I would advise to make UTF-8 the default. Super fast parsers are nice, but speed should take a backseat to correcteness.

casperisfine commented 4 years ago

So on my machine, using file_ips_benchmark.rb:

current master:

FastJsonparser 874.782 (± 2.1%) i/s - 4.450k in 5.089163s

This branch:

FastJsonparser 786.259 (± 2.2%) i/s - 3.978k in 5.061801s

casperisfine commented 4 years ago

One alternative could be to check in cpp wether the string is ASCII only. It still would be a bit slower, but not as much.

anilmaurya commented 4 years ago

@casperisfine

One alternative could be to check in cpp wether the string is ASCII only.

This sounds a good option to me, can we get a benchmark for this option ? Thank you for suggestion.

casperisfine commented 4 years ago

So I tried that here: https://github.com/casperisfine/fast_jsonparser/commit/c14d0e768d9ac3b425d5ee8bd9f1093ab4822197

master:

Warming up --------------------------------------
      FastJsonparser    86.000  i/100ms
                  OJ    56.000  i/100ms
       standard JSON    31.000  i/100ms
Calculating -------------------------------------
      FastJsonparser    850.404  (± 2.6%) i/s -      4.300k in   5.059867s
                  OJ    564.745  (± 1.6%) i/s -      2.856k in   5.058347s
       standard JSON    315.189  (± 1.6%) i/s -      1.581k in   5.017322s

this branch:

Warming up --------------------------------------
      FastJsonparser    77.000  i/100ms
                  OJ    56.000  i/100ms
       standard JSON    32.000  i/100ms
Calculating -------------------------------------
      FastJsonparser    781.121  (± 1.8%) i/s -      3.927k in   5.029071s
                  OJ    562.968  (± 2.3%) i/s -      2.856k in   5.076014s
       standard JSON    317.943  (± 2.2%) i/s -      1.600k in   5.034690s

the ascii check branch:

Warming up --------------------------------------
      FastJsonparser    78.000  i/100ms
                  OJ    56.000  i/100ms
       standard JSON    31.000  i/100ms
Calculating -------------------------------------
      FastJsonparser    777.211  (± 2.4%) i/s -      3.900k in   5.021019s
                  OJ    555.251  (± 3.2%) i/s -      2.800k in   5.048275s
       standard JSON    318.336  (± 1.9%) i/s -      1.612k in   5.065718s

So at least on that particular benchmark, it's not really much better. I could try to port the SIMD ascii checks, but that would only help for larger strings, so I really don't expect any improvement for objects keys in realistic scenarios.

casperisfine commented 4 years ago

The solution to this might come from simdjson in the future:

casperisfine commented 4 years ago

cc @lemire since you commented on this repo, simdjson/simdjson#184 / simdjson/simdjson#264 would be very useful for the performance of the ruby bindings (and possibly other languages?)

lemire commented 4 years ago

@casperisfine

The solution to this might come from simdjson in the future

We are committed to helping the bindings become better. Yet we have to understand what the problems are.

Issue 184 has to do with how we could have different storage depending on the length of the string. It would have mostly an internal (to simdjson) benefit. I don't think we would expand our API so that you can get "short and normal" strings. That would be a nightmare of usability. It would have to provide very clear performance benefits to the users, and it seems unlikely. I must say that issue 184 is unlikely to ever be implemented, and I think we will drop it. When I first thought about it, I figured that most keys would be very short, but when you look at actual JSON documents, you find plenty of relatively long keys.

Issue 264 is more interesting. The request is for a feature where you could ask whether all keys are ASCII. Note that this is only for keys, not for strings in general (as they are less likely to be ASCII). This is justified because, very often, keys are only ASCII. But it is not clear to me why it is best done within simdjson. You can check, for yourself, whether a given key is ASCII. That is easy enough. If people find it convenient, we could certainly provide a helper function which would take a string_view instance and tell you whether it is ASCII.

I could try to port the SIMD ascii checks, but that would only help for larger strings, so I really don't expect any improvement for objects keys in realistic scenarios.

Given how short keys are, you probably don't need SIMD at all to get great performance and for a problem so simple, compilers can autovectorize (gcc with -O3 and clang with -O2).

As you no doubt realize, checking that a string is ASCII is cheap. It is easy to program.

lemire commented 4 years ago

@casperisfine

I'd like to know more about your expectations. Do you expect the JSON inputs to be all ASCII?

casperisfine commented 4 years ago

@lemire I need to leave my desk soon but I'll try to summarize quickly, I'll extend my answer tomorrow.

In short Ruby have symbols (basically interned strings) and you can build them fast from C extensions with little GC overhead. However that cheap method is only available and valid for pure ASCII strings. If you provide it with with a string containing characters outside the ASCII range, you'll end up with a pretty much broken object.

That's why I opened this PR, to do this in a more correct, but unfortunately slower, way. So I was trying to see if I could conciliate both with a cheap check for ASCII only keys. I did it this way: https://github.com/anilmaurya/fast_jsonparser/commit/c14d0e768d9ac3b425d5ee8bd9f1093ab4822197 but the benchmark didn't improve. Now it's very possible I made a mistake either in the implementation (I'm mostly a C/C++ newbie) or in the benchmark.

I also didn't expect SIMD to be any help for JSON keys.

lemire commented 4 years ago

@casperisfine You can probably do a bit better regarding the ASCII check. Check my blog post: Avoid character-by-character processing when performance matters.

It is not going to be a massive gain, but it might push it above your desired threshold.

casperisfine commented 4 years ago

Ok, so I ran the benchmark again, and honestly the difference between rb_intern2 and rb_intern_str(rb_utf8_str_new()) isn't that big at all. I must have done something wrong in my benchmark, or simply had bad variance.

Thanks for the advices!

anilmaurya / fast_jsonparser

Properly handle UTF-8 keys and values #8