Temporary allocations from std::stable_sort even when using ojson

Thanks four providing this library!

I am trying to use it because of its memory-efficient streaming capabilities for very large JSON files.

However, the parsing performance seems slow to me. It takes ~60s for 2.5 GiB, so ~60 MB/s. That seems slow when hearing of (untested) multiple GB/s for simdjson. Even gzip decompression is many times faster (> 200 MB/s).

I tried to look a bit into performance bottlenecks and most of the time is spent almost equally much in basic_json_parser::parse_some, basic_json_parser::parse_string, and jsoncons::unicode_traits::is_legal_utf8. Nothing I can do much about. I also did take a look with heaptrack and was surprised to see millions of temporary allocations from inside std::stable_sort from inside jsoncons::json.

Ok, so I found it sorts keys alphabetically. I find this a questionable default because it sacrifices performance for something I don't need but it's fine as long as there is an alternative. Next, I tried jsoncons::ojson, which is said to preserve the original order, thinking that this should get rid of the problematic std::stable_sort.

But jsoncons::ojson also calls std::stable_sort unexpectedly.

I tried to copy-paste the code of order_preserving_json_object and order_preserving_policy into a patched_jsoncons namespace and replaced the std::stable_sort inside order_preserving_json_object::build_index with std::sort and voila, the millions temporary allocations from std::stable_sort are gone. I'm still seeing millions of temporary allocations inside build_index for the _index.reserve call though. I'm not sure why it is flagged as temporary. The runtime improvements for a 2.5 GiB chrome trace JSON file are unfortunately not significant but those allocations still seem wasteful.

Now, I'm not 100% sure why std::stable_sort is used. It seems unnecessary to me. It has to be sorted somehow because std::unique is called at some point. (Sorting inside build_index and then calling std::unique outside assuming that build_index did sort seems hard to read to me. I think the sort and call to std::unique should be right next to each other.)

What compiler, architecture, and operating system?

Compiler: Clang 14
Architecture (e.g. x86, x64) x64
Operating system: Ubuntu 22.04

What jsoncons library version?

[x] Latest release 0.168.7

Thanks for the feedback.

Are you building a json (ojson) value for the entire 2.5 GB file? An alternative is to use the pull streaming API, specifically json_stream_cursor or json_string_cursor.

There are limitations to performance and memory efficiency when reading large JSON texts into trees of individually mutable JSON values (jsoncons approach.) It would be far more performant and memory efficient to read them into a single parsed document and allow immutable access only to its elements (simdjson approach). We've thought about supporting that as well. Is mutability of the in-memory representation a requirement for you?

Thanks for the comments specifically on std::stable_sort. I think we need to provide a custom implementation of stable sort to take control over memory usage, to either use an in place stable sort or to use a shared buffer.

Some of the things that you're observing are related to how we handle the possibility of duplicate keys in the JSON text, and our requirement to keep the last one received in the parsed json value. We may need to revisit that.

json and ojson are implemented as flat maps, json is sorted and uses binary search for access, ojson maintains original order and keeps a separate sorted array of keys to support access through binary search.

We have in progress a ujson (u for unordered) that uses an efficient flat memory unordered map for JSON objects following robin-hood-hashing. That may make it into the next release.

Daniel

Are you building a json (ojson) value for the entire 2.5 GB file? An alternative is to use the pull streaming API, specifically json_stream_cursor or json_string_cursor.

I'm using the streaming API. However, I'm also delegating certain small nested objects to be parsed as JSON.

There are limitations to performance and memory efficiency when reading large JSON texts into trees of individually mutable JSON values (jsoncons approach.) It would be far more performant and memory efficient to read them into a single parsed document and allow immutable access only to its elements (simdjson approach). We've thought about supporting that as well. Is mutability of the in-memory representation a requirement for you?

Mutability is not a requirement. However, as the data can become arbitrarily large, the streaming approach is a requirement. Simdjson does not offer that. There seems to be a SAX-style API but it still is not a public API I think. If the immutable parsing approach could be used for the nested smaller objects while still streaming over the large array of objects I have then that could be nice I think. But, I'm not sure how well that would translate to performance gains. The objects inside the large array are ~256 B.

Thanks for the comments specifically on std::stable_sort. I think we need to provide a custom implementation of stable sort to take control over memory usage, to either use an in place stable sort or to use a shared buffer.

As the notes on cppreference state, you might still use std::stable_sort but just specify an allocator that always fails to allocate. In that case it will fall back to a memory-efficient O(log^2(N) N) algorithm instead of the standard O(log(N) N) one. Unfortunately, it doesn't seem to be able to specify it by template argument but maybe something could be done by overloading new inside a dedicated namespace. It might indeed be easier to roll out a custom stable sort implementation.

json and ojson are implemented as flat maps, json is sorted and uses binary search for access, ojson maintains original order and keeps a separate sorted array of keys to support access through binary search.

In my case, the index seems unnecessary because the number of keys is small (4 to 6). In that case a linear search should be most efficient anyway, so a simple std::find instead of std::equal_range works fine. The sorting might still be necessary when accounting for duplicates though. Btw, I don't see why std::equal_range is used anyway. Only the first element in the found range will be returned for the call to at.

We have in progress a ujson (u for unordered) that uses an efficient flat memory unordered map for JSON objects following robin-hood-hashing. That may make it into the next release.

Note that std::unordered_map does an allocation for each key-value pair. In other parts of my program I had to replace that with std::vector because the key-value pairs were quite small.

I also noticed the call to reserve inside order_preserving_json_object::insert while using heaptrack. I'm not sure if you are aware but it seems to be used in a kind of antipattern. Quote from cppreference reserve:

Correctly using reserve() can prevent unnecessary reallocations, but inappropriate uses of reserve() (for instance, calling it before every push_back() call) may actually increase the number of reallocations (by causing the capacity to grow linearly rather than exponentially) and result in increased computational complexity and decreased performance. For example, a function that receives an arbitrary vector by reference and appends elements to it should usually not call reserve() on the vector, since it does not know of the vector's usage characteristics.

When inserting a range, the range version of insert() is generally preferable as it preserves the correct capacity growth behavior, unlike reserve() followed by a series of push_back()s.

In my case, it does not matter though because insert is only ever called once by the parsing algorithm when the object itself is empty. But in general, the call to reserve might be better to be replaced with a combination of resize and std::transform I think. Benchmarks would have to be done to be sure that that works equally well to the suggested range insert method.

For the approach of iterating in a streaming way over the large array and using an immutable one-shot parser for the nested objects, another benefit would be that those string_views that jsoncons already returns would have a more well-defined lifetime. Currently, I'm iterating over the SAX style values and after I have a nested object fully parsed, I can process it. However, my assumption is that the string_view lifetime ends as soon as cursor::next has been called. This means I have to do some unnecessary string copies because I have no control over the underlying I/O buffer into which the string views presumably point. Assuming that the JSON variant-like object returned by the one-shot parser would behave like a pure view into the input buffer, except maybe for integer values and such, those string copies could be avoided altogether. Whether it improves runtime is separate matter though.

Thanks for the comments. I'll have to think about this.

Incidentally, ujson wouldn't use std::unordered_map.

I also had to implement a stable sorting in my code and std::stable_sort was 3x slower than a simple std::sort. My solution was to add an index member and then sort it with that as a secondary sort key:

for ( size_t i = 0; i < values.size(); ++i ) {
    values[i].index = i;
}
std::sort( 
    values.begin(), values.end(),
    [] ( const auto& a, const auto& b ) {
        return std::make_tuple( a.sortKey, a.index ) < std::make_tuple( b.sortKey, b.index );
    }
);

This is ~50% slower than std::sort on only one key but still almost 2x faster than std::stable_sort.

Following @mxmlnkn's suggestion, json_decoder now uses code that performs stable sorting with an index and std::sort, code is on master. That should address the issue of temporary allocations from std::stable_sort during deserialization to json or ojson values.

danielaparker / jsoncons

Temporary allocations from std::stable_sort even when using ojson #391