Overhead over raw API - Githubissues

zann1x commented 4 months ago

In some tests with CBOR I have noticed that reflect-cpp adds a significant overhead over tinycbor. This makes it less appealing to use this library in performance critical environments.

E.g., a minimal struct

struct Minimal {
  double foo;
  double bar;
  double baz;
  double qux;
};

has the following benchmark results:

~/projects/reflect-cpp$ ./build/linux-release/reflectcpp-bm
2024-03-04T21:29:54+01:00
Running ./build/linux-release/reflectcpp-bm
Run on (4 X 400 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x2)
  L1 Instruction 32 KiB (x2)
  L2 Unified 512 KiB (x2)
  L3 Unified 4096 KiB (x1)
Load Average: 0.06, 0.05, 0.01
---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
BM_CBOR_Write               160 ns          160 ns      4360721
BM_CBOR_Read                427 ns          427 ns      1645479
BM_CBOR_Write_tinycbor     28.9 ns         28.9 ns     24334360
BM_CBOR_Read_tinycbor       225 ns          225 ns      3099686

I've tried replacing usages of Box with stack objects in CBOR's Writer (see https://github.com/zann1x/reflect-cpp/commit/9addb7cfadce0247ea27fd20999bdcc04c83778c). This improves performance a bit, but there is still a 10x difference in comparison to tinycbor.

liuzicheng1987 commented 4 months ago

@zann1x , what exactly are you benchmarking here? Can you share the code? Could you also share how you compiled it (including optimizations, etc)?

I have done some benchmarks on this myself and I didn't even come close the kind of overhead you get here.

But I think it would be worthwhile to profile this using gperftools or something similar to really get to the bottom of this.

zann1x commented 4 months ago

The benchmarks can be found here: https://github.com/zann1x/reflect-cpp/commit/a7be8b0047ae289ae0f7ba825d81c8e989b2b26f I built it with

$ cmake -S . -B build/linux-release -DCMAKE_BUILD_TYPE=Release -DREFLECTCPP_CBOR=ON -DREFLECTCPP_BUILD_BENCHES=ON
$ cmake --build build/linux-release/

Admittedly, I didn't quite recreate the same calls to tinycbor as used in reflect-cpp. I've updated the benchmarks a bit to be more comparable here. The current benchmark results (with the code linked above) are

$ ./build/linux-release/benches/cbor/reflect-cpp-cbor-benches
2024-03-05T22:12:25+01:00
Running ./build/linux-release/benches/cbor/reflect-cpp-cbor-benches
Run on (4 X 400 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x2)
  L1 Instruction 32 KiB (x2)
  L2 Unified 512 KiB (x2)
  L3 Unified 4096 KiB (x1)
Load Average: 0.40, 0.31, 0.18
------------------------------------------------------------
Benchmark                  Time             CPU   Iterations
------------------------------------------------------------
BM_Write                 260 ns          260 ns      2693909
BM_Write_tinycbor        107 ns          107 ns      6561436
BM_Read                  982 ns          982 ns       713135
BM_Read_tinycbor         284 ns          284 ns      2466709

liuzicheng1987 commented 4 months ago

@zann1x , interesting, thank you so much for sharing.

I will play with this, but one of the things that I immediately noticed is that you are not doing any error checking on the read operation in your raw TinyCBOR. This creates an unfair advantage for your raw TinyCBOR.

I don’t think you should ever skip error handling for reads in production code, otherwise your program would segfault if the input file does not conform to your expectations. It is strongly recommended by the TinyCBOR documentation:

“Error conditions must not be ignored. All decoder functions have undefined behavior if called after an error has been reported, and may crash.”

https://intel.github.io/tinycbor/current/a00047.html

Likewise, you should check your types before reading them:

“Some functions are also documented to have preconditions, like cbor_value_get_int() requiring that the input be an integral value. Violation of preconditions also results in undefined behavior and the program may crash.”

The other thing is that our read code is a bit more tolerant about types. So for instance, if the value in the input file were an integer it would just take that and cast it to a float.

I don’t think this fully explains the difference, but it is certainly part of the explanation.

Another comment I have is that your test struct is extremely small. It may very well be the case that certain one-time operations, such as setting up the reader, account for most of the runtime here.

Maybe we could also try to create a vector of 10,000 structs and repeatedly serialize and deserialize that. I think if we do that, the benchmark results would look different as well.

But I will try to build and recreate your benchmarks. I think a profiling will help us get to the bottom of this.

liuzicheng1987 commented 4 months ago

@zann1x , great job on setting up the build pipeline for the benchmarks. I was able to reproduce your results with minimal effort.

I have noticed another aspect in which your read benchmark is not quite fair: You never check the field names, you just read in the fields assuming that they are in the right order.

One of the major advantages of using field names is that you do not have to make that assumption and our reader doesn't. Of course, interpreting and ordering the strings creates some runtime overhead, but it is very important. Otherwise, you should not be using structs at all...you could just use std::tuple which would be even more efficient than your solution.

But the writer benchmark is very interesting. I suspect that the performance difference is due to the fact that we unnecessary read the field names into a std::string, when a raw pointer would do the job. This is something that can be fixed, of course.

liuzicheng1987 commented 4 months ago

Using std::string_view to set the field names does indeed improve the performance of the writer. We are now creeping closer and closer to the raw implementation.

------------------------------------------------------------
Benchmark                  Time             CPU   Iterations
------------------------------------------------------------
BM_Write_tinycbor        103 ns          103 ns      5206079
BM_Write                 160 ns          160 ns      4326175

zann1x commented 4 months ago

I know that the benchmarks are not 100% comparable, so I don't expect an equally fast en-/decoding process. Nevertheless, seeing the huge performance difference when reading the struct with tinycbor vs reflect-cpp was very unexpected to me.

The test I have set up here also is not a representation for structs of my specific use case. They are actually much larger with several nested structs in them. Not using the automatic type reflection there would be a PITA. I simply created a minimal working example for this issue that still shows the problem. For comparison, one of the structs I use had the following benchmark results:

-------------------------------------------------------
Benchmark             Time             CPU   Iterations
-------------------------------------------------------
BM_Write           4412 ns         4412 ns       194783
BM_Read           19457 ns        19252 ns        37333

Only being able to read ~50k elements per second currently prevents me from using reflect-cpp for production use cases.

Using std::string_view to set the field names does indeed improve the performance of the writer. We are now creeping closer and closer to the raw implementation.
------------------------------------------------------------
Benchmark                  Time             CPU   Iterations
------------------------------------------------------------
BM_Write_tinycbor        103 ns          103 ns      5206079
BM_Write                 160 ns          160 ns      4326175

This is very promising to see. Having such relatively minor differences would be great to have on both reading and writing.

liuzicheng1987 commented 4 months ago

@zann1x , I see.

I have just implemented a more comprehensive fix that consistently uses std::string_view to write the field names. It's still in a feature branch, but I am confident I can merge it into main fairly soon.

I will have to take a closer look at the read process over the next couple of days.

If you can think of a performance optimization as well, feel free to open a PR.

liuzicheng1987 commented 4 months ago

@zann1x , I have tried adding more safety to the code and checking the field names. None of these explains the difference.

My current hypothesis is the following: I use a memoization pattern. The first time a struct of type X is read, I build a hash map mapping the field names to their positions in the struct. But then the hash map is stored for future use. Maybe the benchmark library somehow forces us to build the hash map every time.

I will investigate further and keep you posted.

liuzicheng1987 commented 4 months ago

@zann1x , as promised, here's an update:

It appears that the overhead is mainly due to how C++ optimizes (or fails to optimize) recursive programming. Through some careful experimentation, I have managed to get overhead down to almost negligible levels:

2024-03-11T07:20:31+01:00
Running ./build/linux-release/benches/cbor/reflect-cpp-cbor-benches
Run on (4 X 48 MHz CPU s)
Load Average: 0.58, 0.44, 0.31
---------------------------------------------------------------------------------
Benchmark                                       Time             CPU   Iterations
---------------------------------------------------------------------------------
BM_Write_tinycbor                             106 ns          106 ns      5486243
BM_Write                                      161 ns          161 ns      4311409
BM_Read_tinycbor_unsafe                       263 ns          263 ns      2655862
BM_Read_tinycbor_safe_with_field_names        336 ns          336 ns      2072860
BM_Read                                       353 ns          353 ns      2003000

As you can see, the overhead of BM_Read over BM_Read_tinycbor_safe_with_field_names (which I would regard to be a fair comparison) is now less than 10%.

It's still in a feature branch, not ready for prime time and currently only works for CBOR, but if you want to give it a try yourself, it's this branch:

https://github.com/getml/reflect-cpp/tree/f/read_optimization

zann1x commented 4 months ago

Thanks for your quick responses on this issue!

After your optimizations on the writer and reader, the benchmark of the struct that I referenced in https://github.com/getml/reflect-cpp/issues/68#issuecomment-1981230176 got down to:

-----------------------------------------------------------
Benchmark                 Time             CPU   Iterations
-----------------------------------------------------------
BM_Write               2299 ns         2302 ns       298667
BM_Read               11224 ns        11230 ns        64000

This is great already!

Without having looked at it in detail, it seems like the logic for matching a struct's field names could also be an optimization opportunity. It looks like for every field of a struct there's an iteration through all of the current map's bytes (https://github.com/getml/reflect-cpp/blob/54785ec893c72efc0d167460c6a63d616698c816/include/rfl/cbor/Reader.hpp#L63-L82). Would it make sense to only iterate these bytes once and store the key-value pairs in a std container? At least that sounds faster than parsing the bytes each and every time.

liuzicheng1987 commented 4 months ago

@zann1x , the function get_field is only used for the so-called tagged unions, specifically to get the tags.

https://github.com/getml/reflect-cpp/blob/main/docs/variants_and_tagged_unions.md

Since the tags are usually the first field, this isn’t much of a problem.

The function used to iterate through a map in the new code is to_view:

https://github.com/getml/reflect-cpp/blob/d00a195d9965888c20771fcb5844950e7b5da1ed/include/rfl/cbor/Reader.hpp#L262

A view is a named tuple that contains the field names and pointers to the original fields of the underlying struct.

The view reader then directly reads and parses the object onto the struct. In other words, the resulting machine code should be very similar to hand-written code.

liuzicheng1987 commented 4 months ago

@zann1x , I have now written a more comprehensive implementation which is much closer to production-readiness in terms of its code quality. Here are the latest benchmarks:

Run on (4 X 48 MHz CPU s)
Load Average: 0.30, 0.24, 0.42
---------------------------------------------------------------------------------
Benchmark                                       Time             CPU   Iterations
---------------------------------------------------------------------------------
BM_Write_tinycbor                            96.0 ns         96.0 ns      6411465
BM_Write                                      161 ns          161 ns      4353111
BM_Read_tinycbor_unsafe                       264 ns          264 ns      2649367
BM_Read_tinycbor_safe_with_field_names        331 ns          331 ns      2116116
BM_Read                                       340 ns          340 ns      2051511

I think we are pretty close at this point.

zann1x commented 4 months ago

That's nice to hear.

Unfortunately, it's still too slow for my use case 😞 The upper time limit for reading the struct from https://github.com/getml/reflect-cpp/issues/68#issuecomment-1989313539 would be 3500 ns. But at this point I guess it's more a limitation of tinycbor than of reflect-cpp.

liuzicheng1987 commented 4 months ago

I see. Thanks for flagging this anyway.

I think that going forward, I will introduce more formalized benchmarking along the lines you have laid out. I think it will be very interesting to see, not only on terms of monitoring the overhead of our own API but also comparing the different formats.

I'm curious about your use case: Does it have to be CBOR? Do you think that another CBOR library is considerably faster than TinyCBOR? If so, I might consider using that instead...

zann1x commented 4 months ago

I think that going forward, I will introduce more formalized benchmarking along the lines you have laid out. I think it will be very interesting to see, not only on terms of monitoring the overhead of our own API but also comparing the different formats.

Yup, that makes a lot of sense 👍🏻

Does it have to be CBOR?

No, it doesn't have to be CBOR. I just think that CBOR seems to be a well designed format. It is a nice tradeoff between the advantages of a binary format while also maintaining some kind of readability (regarding key-value pairs) when pushing the bytes directly into a debug log. Also, I like the fact that it is standardized via RFC 8949.

Do you think that another CBOR library is considerably faster than TinyCBOR? If so, I might consider using that instead...

In some tests I made, I felt like the streaming API of https://github.com/PJK/libcbor was slightly faster than tinycbor. But it wasn't quite as convenient to use.

liuzicheng1987 commented 4 months ago

In some tests I made, I felt like the streaming API of https://github.com/PJK/libcbor was slightly faster than tinycbor. But it wasn't quite as convenient to use.

Interesting. It appears to be on vcpkg as well.

I really don't care about convenience here...the point of reflect-cpp is to abstract away these kind of things.

When you say "slightly faster"...how much faster is it? Do you still have the numbers?

zann1x commented 4 months ago

When you say "slightly faster"...how much faster is it? Do you still have the numbers?

IIRC it was ~10% faster on the writing side and ~40% faster on the reading side. But again, all of these tests were made without checks for field order etc.

liuzicheng1987 commented 4 months ago

I am going to close this in favour of #70

getml / reflect-cpp

Overhead over raw API #68