Closed zann1x closed 4 months ago
@zann1x , what exactly are you benchmarking here? Can you share the code? Could you also share how you compiled it (including optimizations, etc)?
I have done some benchmarks on this myself and I didn't even come close the kind of overhead you get here.
But I think it would be worthwhile to profile this using gperftools or something similar to really get to the bottom of this.
The benchmarks can be found here: https://github.com/zann1x/reflect-cpp/commit/a7be8b0047ae289ae0f7ba825d81c8e989b2b26f I built it with
$ cmake -S . -B build/linux-release -DCMAKE_BUILD_TYPE=Release -DREFLECTCPP_CBOR=ON -DREFLECTCPP_BUILD_BENCHES=ON
$ cmake --build build/linux-release/
Admittedly, I didn't quite recreate the same calls to tinycbor as used in reflect-cpp. I've updated the benchmarks a bit to be more comparable here. The current benchmark results (with the code linked above) are
$ ./build/linux-release/benches/cbor/reflect-cpp-cbor-benches
2024-03-05T22:12:25+01:00
Running ./build/linux-release/benches/cbor/reflect-cpp-cbor-benches
Run on (4 X 400 MHz CPU s)
CPU Caches:
L1 Data 48 KiB (x2)
L1 Instruction 32 KiB (x2)
L2 Unified 512 KiB (x2)
L3 Unified 4096 KiB (x1)
Load Average: 0.40, 0.31, 0.18
------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------
BM_Write 260 ns 260 ns 2693909
BM_Write_tinycbor 107 ns 107 ns 6561436
BM_Read 982 ns 982 ns 713135
BM_Read_tinycbor 284 ns 284 ns 2466709
@zann1x , interesting, thank you so much for sharing.
I will play with this, but one of the things that I immediately noticed is that you are not doing any error checking on the read operation in your raw TinyCBOR. This creates an unfair advantage for your raw TinyCBOR.
I don’t think you should ever skip error handling for reads in production code, otherwise your program would segfault if the input file does not conform to your expectations. It is strongly recommended by the TinyCBOR documentation:
“Error conditions must not be ignored. All decoder functions have undefined behavior if called after an error has been reported, and may crash.”
https://intel.github.io/tinycbor/current/a00047.html
Likewise, you should check your types before reading them:
“Some functions are also documented to have preconditions, like cbor_value_get_int() requiring that the input be an integral value. Violation of preconditions also results in undefined behavior and the program may crash.”
The other thing is that our read code is a bit more tolerant about types. So for instance, if the value in the input file were an integer it would just take that and cast it to a float.
I don’t think this fully explains the difference, but it is certainly part of the explanation.
Another comment I have is that your test struct is extremely small. It may very well be the case that certain one-time operations, such as setting up the reader, account for most of the runtime here.
Maybe we could also try to create a vector of 10,000 structs and repeatedly serialize and deserialize that. I think if we do that, the benchmark results would look different as well.
But I will try to build and recreate your benchmarks. I think a profiling will help us get to the bottom of this.
@zann1x , great job on setting up the build pipeline for the benchmarks. I was able to reproduce your results with minimal effort.
I have noticed another aspect in which your read benchmark is not quite fair: You never check the field names, you just read in the fields assuming that they are in the right order.
One of the major advantages of using field names is that you do not have to make that assumption and our reader doesn't. Of course, interpreting and ordering the strings creates some runtime overhead, but it is very important. Otherwise, you should not be using structs at all...you could just use std::tuple which would be even more efficient than your solution.
But the writer benchmark is very interesting. I suspect that the performance difference is due to the fact that we unnecessary read the field names into a std::string, when a raw pointer would do the job. This is something that can be fixed, of course.
Using std::string_view to set the field names does indeed improve the performance of the writer. We are now creeping closer and closer to the raw implementation.
------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------
BM_Write_tinycbor 103 ns 103 ns 5206079
BM_Write 160 ns 160 ns 4326175
I know that the benchmarks are not 100% comparable, so I don't expect an equally fast en-/decoding process. Nevertheless, seeing the huge performance difference when reading the struct with tinycbor vs reflect-cpp was very unexpected to me.
The test I have set up here also is not a representation for structs of my specific use case. They are actually much larger with several nested structs in them. Not using the automatic type reflection there would be a PITA. I simply created a minimal working example for this issue that still shows the problem. For comparison, one of the structs I use had the following benchmark results:
-------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------
BM_Write 4412 ns 4412 ns 194783
BM_Read 19457 ns 19252 ns 37333
Only being able to read ~50k elements per second currently prevents me from using reflect-cpp for production use cases.
Using std::string_view to set the field names does indeed improve the performance of the writer. We are now creeping closer and closer to the raw implementation.
------------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------------ BM_Write_tinycbor 103 ns 103 ns 5206079 BM_Write 160 ns 160 ns 4326175
This is very promising to see. Having such relatively minor differences would be great to have on both reading and writing.
@zann1x , I see.
I have just implemented a more comprehensive fix that consistently uses std::string_view to write the field names. It's still in a feature branch, but I am confident I can merge it into main fairly soon.
I will have to take a closer look at the read process over the next couple of days.
If you can think of a performance optimization as well, feel free to open a PR.
@zann1x , I have tried adding more safety to the code and checking the field names. None of these explains the difference.
My current hypothesis is the following: I use a memoization pattern. The first time a struct of type X is read, I build a hash map mapping the field names to their positions in the struct. But then the hash map is stored for future use. Maybe the benchmark library somehow forces us to build the hash map every time.
I will investigate further and keep you posted.
@zann1x , as promised, here's an update:
It appears that the overhead is mainly due to how C++ optimizes (or fails to optimize) recursive programming. Through some careful experimentation, I have managed to get overhead down to almost negligible levels:
2024-03-11T07:20:31+01:00
Running ./build/linux-release/benches/cbor/reflect-cpp-cbor-benches
Run on (4 X 48 MHz CPU s)
Load Average: 0.58, 0.44, 0.31
---------------------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------------------
BM_Write_tinycbor 106 ns 106 ns 5486243
BM_Write 161 ns 161 ns 4311409
BM_Read_tinycbor_unsafe 263 ns 263 ns 2655862
BM_Read_tinycbor_safe_with_field_names 336 ns 336 ns 2072860
BM_Read 353 ns 353 ns 2003000
As you can see, the overhead of BM_Read
over BM_Read_tinycbor_safe_with_field_names
(which I would regard to be a fair comparison) is now less than 10%.
It's still in a feature branch, not ready for prime time and currently only works for CBOR, but if you want to give it a try yourself, it's this branch:
https://github.com/getml/reflect-cpp/tree/f/read_optimization
Thanks for your quick responses on this issue!
After your optimizations on the writer and reader, the benchmark of the struct that I referenced in https://github.com/getml/reflect-cpp/issues/68#issuecomment-1981230176 got down to:
-----------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------
BM_Write 2299 ns 2302 ns 298667
BM_Read 11224 ns 11230 ns 64000
This is great already!
Without having looked at it in detail, it seems like the logic for matching a struct's field names could also be an optimization opportunity. It looks like for every field of a struct there's an iteration through all of the current map's bytes (https://github.com/getml/reflect-cpp/blob/54785ec893c72efc0d167460c6a63d616698c816/include/rfl/cbor/Reader.hpp#L63-L82). Would it make sense to only iterate these bytes once and store the key-value pairs in a std container? At least that sounds faster than parsing the bytes each and every time.
@zann1x , the function get_field is only used for the so-called tagged unions, specifically to get the tags.
https://github.com/getml/reflect-cpp/blob/main/docs/variants_and_tagged_unions.md
Since the tags are usually the first field, this isn’t much of a problem.
The function used to iterate through a map in the new code is to_view:
A view is a named tuple that contains the field names and pointers to the original fields of the underlying struct.
The view reader then directly reads and parses the object onto the struct. In other words, the resulting machine code should be very similar to hand-written code.
@zann1x , I have now written a more comprehensive implementation which is much closer to production-readiness in terms of its code quality. Here are the latest benchmarks:
Run on (4 X 48 MHz CPU s)
Load Average: 0.30, 0.24, 0.42
---------------------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------------------
BM_Write_tinycbor 96.0 ns 96.0 ns 6411465
BM_Write 161 ns 161 ns 4353111
BM_Read_tinycbor_unsafe 264 ns 264 ns 2649367
BM_Read_tinycbor_safe_with_field_names 331 ns 331 ns 2116116
BM_Read 340 ns 340 ns 2051511
I think we are pretty close at this point.
That's nice to hear.
Unfortunately, it's still too slow for my use case 😞 The upper time limit for reading the struct from https://github.com/getml/reflect-cpp/issues/68#issuecomment-1989313539 would be 3500 ns. But at this point I guess it's more a limitation of tinycbor than of reflect-cpp.
I see. Thanks for flagging this anyway.
I think that going forward, I will introduce more formalized benchmarking along the lines you have laid out. I think it will be very interesting to see, not only on terms of monitoring the overhead of our own API but also comparing the different formats.
I'm curious about your use case: Does it have to be CBOR? Do you think that another CBOR library is considerably faster than TinyCBOR? If so, I might consider using that instead...
I think that going forward, I will introduce more formalized benchmarking along the lines you have laid out. I think it will be very interesting to see, not only on terms of monitoring the overhead of our own API but also comparing the different formats.
Yup, that makes a lot of sense 👍🏻
Does it have to be CBOR?
No, it doesn't have to be CBOR. I just think that CBOR seems to be a well designed format. It is a nice tradeoff between the advantages of a binary format while also maintaining some kind of readability (regarding key-value pairs) when pushing the bytes directly into a debug log. Also, I like the fact that it is standardized via RFC 8949.
Do you think that another CBOR library is considerably faster than TinyCBOR? If so, I might consider using that instead...
In some tests I made, I felt like the streaming API of https://github.com/PJK/libcbor was slightly faster than tinycbor. But it wasn't quite as convenient to use.
In some tests I made, I felt like the streaming API of https://github.com/PJK/libcbor was slightly faster than tinycbor. But it wasn't quite as convenient to use.
Interesting. It appears to be on vcpkg as well.
I really don't care about convenience here...the point of reflect-cpp is to abstract away these kind of things.
When you say "slightly faster"...how much faster is it? Do you still have the numbers?
When you say "slightly faster"...how much faster is it? Do you still have the numbers?
IIRC it was ~10% faster on the writing side and ~40% faster on the reading side. But again, all of these tests were made without checks for field order etc.
I am going to close this in favour of #70
In some tests with CBOR I have noticed that reflect-cpp adds a significant overhead over tinycbor. This makes it less appealing to use this library in performance critical environments.
E.g., a minimal struct
has the following benchmark results:
I've tried replacing usages of
Box
with stack objects in CBOR'sWriter
(see https://github.com/zann1x/reflect-cpp/commit/9addb7cfadce0247ea27fd20999bdcc04c83778c). This improves performance a bit, but there is still a 10x difference in comparison to tinycbor.