Slow decoding of a large typed array

dvzubarev commented 4 years ago

Hi, i was experimenting with typed arrays (decoding 13M array of floats). Here is my code: https://gist.github.com/dvzubarev/e153a437305066bb1aa1f4865bbbeb1d

I compiled with such flags: g++ -O2 -g -I ~/installed_thirdparty/jsoncons/jsoncons-0.143.1/include bench.cpp

It took 2s to decode typed array with float elements. I realize that it's not correct to compare with my naive python decoding code, but still its much faster: 0.025 s (see file bench.py).

My impl is suitable only for the specific case, and does not cover all cases like this library. But looking into jsoncons code, I noticed that typed arrays are treated element-wise like ordinary cbor arrays in decode function in ser_traits struct (when is_typed_array == true). I've added v.reserve(13100100) to this function and this change already speeds up decoding considerably. But I'm not sure that it's possible to get the array length at this stage.

My question is: is it possible to process typed array not element-wisely? I think it should greatly speed up decoding.

The same question applies to endiannes converting code in cbor_parser.hpp. AFAIU, there always occurs an element-wise copy of the whole array in cbor_parser.hpp when converting to native endiannes. Is it possible to skip this conversion altogether when decoding LE floats on little-endian machine?

danielaparker commented 4 years ago

Thanks for the feedback. Will investigate.

ecorm commented 4 years ago

It seems the is_typed_array specialization of ser_traits calls ser_traits_default::decode for each element. ser_traits_default::decode ends up instantiating a json_decoder for each element. Perhaps these instantiations are the bottleneck, and the json_decoder needs to be hoisted out of the loop.

danielaparker commented 4 years ago

For your case where endiness and type are compatible between the CBOR typed array and the output vector, it should just be a resize and a memcpy. I should have that up on master in the next day or two.

More generally, the idea of ser_traits is to allow decoding into a std::map or std::vector without requiring that the entire input be first decoded into one big basic_json value, but you're certainly right that the current implementation is inefficient. That needs work too. But for the typed arrays, we won't need basic_json. In a day or two.

Daniel

danielaparker commented 4 years ago

Can you try with the code on master?

Thanks, Daniel

dvzubarev commented 4 years ago

Thank you, It is much faster now (from ~2s to 179 ms). Now according to Callgrind there are two costly functions: (76.45%) get_byte_string(std::error_code&)::{lambda(unsigned long, std::error_code&) (17.92%) handle_byte_string Not sure if it is worth to optimize these functions.

callgrind annotations

``` 1,096,630,844 (100.0%) ???:0x0000000000001090 [/lib/x86_64-linux-gnu/ld-2.27.so] 1,094,459,805 (99.80%) ???:_start [/home/denin/Yandex.Disk/workspace/work/notes/langs/cpp/serialization/a.out] 1,094,459,794 (99.80%) /build/glibc-OTsEL5/glibc-2.27/csu/../csu/libc-start.c:(below main) [/lib/x86_64-linux-gnu/libc-2.27.so] 1,094,349,240 (99.79%) /home/denin/SharedWorkspace/work/notes/langs/cpp/serialization/bench.cpp:main [/home/denin/Yandex.Disk/workspace/work/notes/langs/cpp/serialization/a.out] 1,041,871,748 (95.01%) /home/denin/installed_thirdparty/jsoncons/jsoncons-master/include/jsoncons_ext/cbor/cbor.hpp:std::enable_if >, void>::value, std::vector > >::type jsoncons::cbor::decode_cbor > >(std::vector > const&) [/home/denin/Yandex.Disk/workspace/work/notes/langs/cpp/serialization/a.out] 1,041,871,169 (95.01%) /home/denin/installed_thirdparty/jsoncons/jsoncons-master/include/jsoncons_ext/cbor/cbor_cursor.hpp:std::enable_if >, void>::value, std::vector > >::type jsoncons::cbor::decode_cbor > >(std::vector > const&) 1,034,909,374 (94.37%) /home/denin/installed_thirdparty/jsoncons/jsoncons-master/include/jsoncons_ext/cbor/cbor_cursor.hpp:jsoncons::cbor::basic_cbor_cursor >::next() [/home/denin/Yandex.Disk/workspace/work/notes/langs/cpp/serialization/a.out] 1,034,909,328 (94.37%) /home/denin/installed_thirdparty/jsoncons/jsoncons-master/include/jsoncons_ext/cbor/cbor_parser.hpp:jsoncons::cbor::basic_cbor_parser >::parse(jsoncons::basic_json_content_handler&, std::error_code&) [/home/denin/Yandex.Disk/workspace/work/notes/langs/cpp/serialization/a.out] 1,034,909,276 (94.37%) /home/denin/installed_thirdparty/jsoncons/jsoncons-master/include/jsoncons_ext/cbor/cbor_parser.hpp:jsoncons::cbor::basic_cbor_parser >::read_item(jsoncons::basic_json_content_handler&, std::error_code&) [/home/denin/Yandex.Disk/workspace/work/notes/langs/cpp/serialization/a.out] 838,406,903 (76.45%) /home/denin/installed_thirdparty/jsoncons/jsoncons-master/include/jsoncons_ext/cbor/cbor_parser.hpp:jsoncons::cbor::basic_cbor_parser >::get_byte_string(std::error_code&) [/home/denin/Yandex.Disk/workspace/work/notes/langs/cpp/serialization/a.out] 838,406,859 (76.45%) /home/denin/installed_thirdparty/jsoncons/jsoncons-master/include/jsoncons_ext/cbor/cbor_parser.hpp:void jsoncons::cbor::basic_cbor_parser >::iterate_string_chunks >::get_byte_string(std::error_code&)::{lambda(unsigned long, std::error_code&) [/home/denin/Yandex.Disk/workspace/work/notes/langs/cpp/serialization/a.out] 419,203,206 (38.23%) /usr/include/c++/7/bits/stl_vector.h:void jsoncons::cbor::basic_cbor_parser >::iterate_string_chunks >::get_byte_string(std::error_code&)::{lambda(unsigned long, std::error_code&) 314,402,413 (28.67%) /home/denin/installed_thirdparty/jsoncons/jsoncons-master/include/jsoncons/source.hpp:void jsoncons::cbor::basic_cbor_parser >::iterate_string_chunks >::get_byte_string(std::error_code&)::{lambda(unsigned long, std::error_code&) 196,501,950 (17.92%) /home/denin/installed_thirdparty/jsoncons/jsoncons-master/include/jsoncons_ext/cbor/cbor_parser.hpp:jsoncons::cbor::basic_cbor_parser >::handle_byte_string(jsoncons::basic_json_content_handler&, jsoncons::byte_string_view const&, std::error_code&) [/home/denin/Yandex.Disk/workspace/work/notes/langs/cpp/serialization/a.out] 104,801,116 ( 9.56%) /usr/include/c++/7/ext/new_allocator.h:void jsoncons::cbor::basic_cbor_parser >::iterate_string_chunks >::get_byte_string(std::error_code&)::{lambda(unsigned long, std::error_code&) 52,400,430 ( 4.78%) /usr/include/c++/7/bits/stl_algobase.h:main 52,400,423 ( 4.78%) /build/glibc-OTsEL5/glibc-2.27/string/../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:__memset_avx2_unaligned_erms [/lib/x86_64-linux-gnu/libc-2.27.so] 52,400,414 ( 4.78%) /build/glibc-OTsEL5/glibc-2.27/string/../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:__memset_avx2_erms [/lib/x86_64-linux-gnu/libc-2.27.so] 52,400,402 ( 4.78%) /home/denin/installed_thirdparty/jsoncons/jsoncons-master/include/jsoncons/config/binary_config.hpp:jsoncons::cbor::basic_cbor_parser >::handle_byte_string(jsoncons::basic_json_content_handler&, jsoncons::byte_string_view const&, std::error_code&) 39,300,303 ( 3.58%) /home/denin/installed_thirdparty/jsoncons/jsoncons-master/include/jsoncons/staj_reader.hpp:jsoncons::cbor::basic_cbor_parser >::handle_byte_string(jsoncons::basic_json_content_handler&, jsoncons::byte_string_view const&, std::error_code&) 6,959,897 ( 0.63%) /home/denin/installed_thirdparty/jsoncons/jsoncons-master/include/jsoncons/staj_reader.hpp:jsoncons::basic_staj_event_handler::dump(jsoncons::basic_json_content_handler&, jsoncons::ser_context const&, std::error_code&) [/home/denin/Yandex.Disk/workspace/work/notes/langs/cpp/serialization/a.out] ```

danielaparker commented 4 years ago

That part of the code can be improved, I'll look into it.

danielaparker commented 4 years ago

If you're still following this, I think you'll find the code that's currently on master to be about 30 percent faster. I'll leave it at that.

danielaparker / jsoncons

Slow decoding of a large typed array #205