boostorg / json

A C++11 library for parsing and serializing JSON to and from a DOM container in memory.
https://boost.org/libs/json
Boost Software License 1.0
431 stars 95 forks source link

Feature Request: Parse numbers as strings #872

Open cryptochassis opened 1 year ago

cryptochassis commented 1 year ago

When working with various counterparties dealing with monetary systems, we found that, quite often than not, we'd recieve json strings like [1.2345] instead of ["1.2345"]. If we parse that as a double, then we might loose precisions in some cases. In order to preserve precision, we have to parse that number as a string. rapidjson offers a solution by providing kParseNumbersAsStringsFlag: https://rapidjson.org/namespacerapidjson.html#a81379eb4e94a0386d71d15fda882ebc9a13981c0b803803f59d7a01aef3dfc987. Interesting enough, Python standard json library also offers the capability to parse numbers as strings: https://docs.python.org/3/library/json.html#json.load (see parse_float and parse_int parameters). We are looking into migrating to boost json library. Parsing numbers as strings is a key thing for us to preserve monetary precision. Thank you.

vinniefalco commented 1 year ago

possible in theory, if we add it to the parse options. they will come in as strings. However, we can consider adding a flag to json::string somewhere (if we can find a spare bit) which indicates that the string contains a valid number. This should not affect performance if the option is not set.

grisumbras commented 1 year ago

We technically already support this for parsing. Just use basic_parser with a custom handler. The caveat is that this is way more complicated than it should have been. We could make detail::handler public, and document how to override its functions to achieve custom handling of only a subset of parsing events.

The more complicated part of the eqation is serialisation. We don't have a customisable serialiser. On the other hand, custom serialisation is very easy to implement with iostreams.

So, no special bit for "this is actually a number" is required. BTW, I am sceptical that such change would not affect performance, even if only in a minor way.

@cryptochassis do you only need this special handling for parsing? Is using basic_parser with a custom handler enough for you?

grisumbras commented 1 year ago

Here's an example of what I meant: https://godbolt.org/z/KE7YK7h97

cryptochassis commented 1 year ago

@grisumbras Very sorry for the late reply. I completely missed your previous messages. Yes, we only need this special handling for parsing. Using basic_parser with a custom handler seems to be sufficient. Thanks a lot for providing a concrete example. One question: for the example, when the parser encounters a number, say, a double, will it still call std::stod behind the scene? Because we are a high-frequency-trading code provider, performance is of utmost importance to us. Without calling std::stod, I'd guess it'd save lots of CPU time.

grisumbras commented 1 year ago

The number will still be parsed. But our parser doesn't call std::strod (or any other standard number parsing utility for that matter). We use custom number parsing functions, so maybe it will be fast enough for you.

Also, this made me think we might want a parser option that disables number parsing outright.

cryptochassis commented 1 year ago

We parse about millions of json messages per second and therefore skipping string to number conversion would probably have visible impact on our system's performance. We'd appreciate if there could be provided a parser option that disables number parsing. Many thanks!

vinniefalco commented 1 year ago

if you want the highest performance why don't you use simdjson? Do you need the ability to modify the JSON values?

cryptochassis commented 1 year ago

We don't need the ability to modify the JSON values. At the time that we first started our library development in 2019 and published its first version, simdjson wasn't available. Based on the best judgement at that time, we picked rapidjson. We ourselves is a library rather than an end-user application. The reason that we are now aiming at migrating to boost json instead of simdjson is because a sizable part of our current users (or those who are thinking about using our library) comes from a Python background and therefore are intermediate to beginner levels in C++. They need a simple way of getting started to build their applications using our library. The simplest way is to only rely on the header-only components of boost but nothing else. And we are getting closer to that: currently we only depend on boost, websocketpp, and rapidjson. We are almost there of moving away from websocketpp by using your beast websocket. So now the only thing to trim is rapidjson after which our only dependency are the header-only components of boost. To sum up, the reason is to achieve a good balance between performance and usability aiming at a wide array of audience having vastly different C++ proficiencies.

vinniefalco commented 1 year ago

Wow... that rationale is actually rather perfect :)

cryptochassis commented 1 year ago

The number will still be parsed. But our parser doesn't call std::strod (or any other standard number parsing utility for that matter). We use custom number parsing functions, so maybe it will be fast enough for you.

Also, this made me think we might want a parser option that disables number parsing outright.

Let me know whether we can have such a parser option. Thanks a lot.

grisumbras commented 1 year ago

An option to disable number parsing outright? I have a PR for that (#901). IIRC, my benchmarking shows that for number-heavy inputs the speed of parsing increases by 80% (but don't quote me on that). @vinniefalco should I pursue it?

grisumbras commented 1 year ago

To be clear, it still sort of does number validation (we need it to know when the number ends and the parser should start parsing another value), it just doesn't convert the characters into a number.

vinniefalco commented 1 year ago

its an interesting mode

cryptochassis commented 1 year ago

An option to disable number parsing outright? I have a PR for that (#901). IIRC, my benchmarking shows that for number-heavy inputs the speed of parsing increases by 80% (but don't quote me on that). @vinniefalco should I pursue it?

Perfect. Looking forward to the finalization. Thanks a lot.

grisumbras commented 1 year ago

901 has been merged into develop

grisumbras commented 1 year ago

Local benchmarking results:

                        imprecise   | precise    | none 
Parse gcc   apache_builds.json  754 | 753  -0,13%| 753  -0,13%
Parse gcc   canada.json         587 | 400 -31,86%|1064  81,26%
Parse gcc   citm_catalog.json   1231|1232   0,08%|1344   9,18%
Parse gcc   github_events.json  837 | 845   0,96%| 850   1,55%
Parse gcc   gsoc-2018.json      975 | 977   0,21%| 974  -0,10%
Parse gcc   instruments.json    630 | 640   1,59%| 659   4,60%
Parse gcc   marine_ik.json      531 | 404 -23,92%| 654  23,16%
Parse gcc   mesh.json           532 | 402 -24,44%| 690  29,70%
Parse gcc   mesh.pretty.json    996 | 758 -23,90%|1370  37,55%
Parse gcc   numbers.json        818 | 494 -39,61%|1814 121,76%
Parse gcc   random.json gcc     383 | 384   0,26%| 385   0,52%
Parse gcc   twitter.json        521 | 524   0,58%| 530   1,73%
Parse gcc   twitterescaped.json 478 | 474  -0,84%| 488   2,09%
Parse gcc   update-center.json  660 | 664   0,61%| 663   0,45%
Parse clang apache_builds.json   757| 750  -0,92%| 751  -0,79%
Parse clang canada.json          613| 378 -38,34%| 905  47,63%
Parse clang citm_catalog.json   1225|1196  -2,37%|1234   0,73%
Parse clang github_events.json   800| 793  -0,88%| 807   0,88%
Parse clang gsoc-2018.json       721| 721   0,00%| 717  -0,55%
Parse clang instruments.json     674| 653  -3,12%| 664  -1,48%
Parse clang marine_ik.json       532| 400 -24,81%| 607  14,10%
Parse clang mesh.json            557| 418 -24,96%| 708  27,11%
Parse clang mesh.pretty.json    1086| 771 -29,01%|1373  26,43%
Parse clang numbers.json         854| 524 -38,64%|1742 103,98%
Parse clang random.json          377| 371  -1,59%| 372  -1,33%
Parse clang twitter.json         556| 558   0,36%| 557   0,18%
Parse clang twitterescaped.json  463| 470   1,51%| 468   1,08%
Parse clang update-center.json   594| 597   0,51%| 594   0,00%

canada.json is +81% on GCC and +48% on clang, numbers.json is +122% on GCC and +104% on clang.