beached / daw_json_link

Fast, convenient JSON serialization and parsing in C++
https://beached.github.io/daw_json_link/
Boost Software License 1.0
475 stars 31 forks source link

Update to_chars to use custom JSON specific one #161

Closed friendlyanon closed 3 years ago

friendlyanon commented 4 years ago

https://github.com/jk-jeon/dragonbox
The benchmarks show better conversion times for the general (smaller number of digits) case compared to Ryu.

I'm in the process of assisting moving it to CMake (https://github.com/jk-jeon/dragonbox/issues/4), so maybe this could be something to consider for v3.0

beached commented 4 years ago

It’s been in dev for a few days already:) https://github.com/beached/daw_json_link/tree/develop/include/third_party/dragonbox

friendlyanon commented 4 years ago

I don't want to reopen this, but I'd like to let you know that dragonbox can now be trivially consumed using FetchContent!

beached commented 4 years ago

It builds with a bunch of warnings on my system and i had to make changes to mark all the methods inline as i needed it header only. The warnings were c style cast related and i replaced those with static casts

friendlyanon commented 4 years ago

There is an opt-out warning guard.

Would it really be an issue to have the dragonbox::dragonbox_to_chars target as an INTERFACE dependency of json_link?

beached commented 4 years ago

Ideally it would not be a dep at all, but to_chars is not a thing in all c++17 implementations. I am resistant to any third party deps right now, especially new ones, as they cause breakage and more work when changed happen upstream. It would be a large change to not be header only too.

Without a compelling reason for the change i don’t want to. What is gained?

friendlyanon commented 4 years ago

I am merely contemplating the viability of keeping things header only over having a static library as usage requirement.
You may chalk it up to my inexperience when it comes to this, but I don't see how that would change more than what an already planned major version upgrade could bear, especially when the CMakeLists allow for trivial consumption.

beached commented 4 years ago

The header only is part of it. The other is that it introduces an uncontrolled dependency and I have been burned by that in the past. In the future, this whole, d2s dependency is going away once libstdc++ and libc++ have to_chars implemented properly for double/float

jk-jeon commented 4 years ago

FWIW, it is not super difficult to write your own to_chars using jkj::dragonbox::to_decimal. I can help you if you are willing to do this. You can achieve more flexibility, more safety/error handling stuffs, etc.. And I bet you can easily beat the performance of jkj::dragonbox::to_chars as JSON disallows infinities and NaN's. Also, I personally think the default output of things like 0.02 should be 0.02, not 1E-2 which is the output of jkj::dragonbox::to_chars.

Here is a quick sample code:

char* to_chars(double x, char* buffer) {
  auto br = jkj::dragonbox::ieee754_bits(x);
  assert(br.is_finite());

  if (br.is_negative()) {
    *buffer = '-';
    ++buffer;
  }

  // We already have dealt with the sign so ignore it.
  auto dec = jkj::dragonbox::to_decimal(x, jkj::dragonbox::policy::sign::ignore);

  // You may need to treat +/- zeros specially. There are several ways:
  // 1. check if (br.is_nonzero()).
  // 2. check if (dec.significand == 0)
  // 3. Flip the sign bit of br.u (which is of type uint64_t) inside the above if block, and check if (br.u == 0).
  //    I guess this is the fastest method.

  // Now we have x = +/- (dec.significand) * 10^(dec.exponent). Then:
  //
  // 1. First compute the length of dec.significand which is of type uint64_t.
  //    My way was to throw bunch of if's like if (dec.significand >= 1'00000000'00000000) { return 17; },
  //    but what is the best way depends on your use-case.
  // 2. Determine between fixed-point vs scientific formats using the length determined above and
  //    dec.exponent. A popular way seems to choose whichever that is shorter, and prefer fixed-point
  //    whenever the resulting lengths are the same. But that requires you to compute the length of
  //    exponent as well, which ranges from 1 to 4 including the sign '-'/'+'.
  // 3. Print dec.significand, possibly with '.', possibly with `e` or `E` and dec.exponent according
  //    to the chosen format. Increment buffer accordingly.
  //
  // You may find more information by looking at `dragonbox_to_chars.h` and `dragonbox_to_chars.cpp`.

  return buffer;
}
beached commented 4 years ago

OK, this interests me. I don't need to worry about Infinity/NaN at this point in the code, true. But it is an option when encoding to/from string wrapped numbers. I had it in the back of my mind to make it prettier for things with small exponents.

I like the idea of it being potentially faster and prettier.

beached commented 4 years ago

So thinking about this, and totally arbitrary, but seems like anything with some number, say 5, or less significant digits should be in fixed.

/joking/ just do all of them as an integer for significant digits and an exponent. :)

jk-jeon commented 4 years ago

You can't really ignore the exponent and focus only on the number of significand digits. For example, nobody would be happy if 1e-100 is printed in the fixed-point format. Both the number of significand digits and the exponent should be taken to account.

I know of two popular ways of determining between scientific vs fixed-point formats.

The first one is to choose whatever shorter in length, and prefer fixed-point on tie. This seems to be the common strategy, e.g., for implementing JavaScript engines, and also IIRC this is the way proposed by the classical Dragon4 paper.

The second one is to follow C/C++ standards' general floating-point printing rule; see, e.g., https://en.cppreference.com/w/cpp/io/c/fprintf (AFAIK, other printing features, like std::basic_ostream and std::to_chars, follow the same rule).

In a nutshell, according to my understanding, this rule prefers the fixed-point format if and only if the exponent, when represented in the scientific form, is from -4 to 5, and prefers the scientific format otherwise. So for example, 0.000123456789101112 should be in the fixed-point form, because in the scientific form it would be 1.23456789101112e-4, and the exponent -4 is in the range [-4,5]. For another example, 1234560 should be in the scientific form, because in the scientific form it would be 1.23456e6, and the exponent 6 is out of the range [-4,5].

Personally, I prefer the second option because, well, this is a C++ library, although I honestly can't agree that 1.23456e6 is better than 1234560. If you are planning to replace Dragonbox with std::to_chars in a future, this is the way to go. But in the end it is just up to you :)

beached commented 3 years ago

Implemented via https://github.com/beached/daw_json_link/pull/224/commits/41d50deee3f37777d9d3ef4ea6bc540b7f879218