eclipse-cyclonedds / cyclonedds-cxx

Other
93 stars 74 forks source link

Some questions about c++ performance #382

Open TTT-321 opened 1 year ago

TTT-321 commented 1 year ago

In the performance test of c language, I found that the number of samples was pre-allocated, but in the performance test of C ++ language, the samples were dynamically allocated according to the number of samples obtained from rhc, which would cause a lot of new delete operations. What are the considerations here? Can we pre-allocate space to improve execution efficiency? As can be seen from the flame diagram, the cpu consumption of take samples in c++ language is much higher than that of take samples in c language. The big difference is the number of rhc samples obtained and the space allocated. c++, which deserializes samples before saving them, should take one less step than c takes, but the actual performance c language is much better.

In addition, I also found that when using listener, wrapper function also takes up a considerable amount of cpu consumption. Is there room for optimization here?

eboasson commented 1 year ago

I think you all the points you raise are good points. At least on Linux, based on how fast the memory allocation is, I would think that whether you pre-allocate or makes a substantial but not really significant difference. I may be wrong there. The C++ API certainly allows for pre-allocation, it might be worth using that here.

As can be seen from the flame diagram, the cpu consumption of take samples in c++ language is much higher than that of take samples in c language.

Did you forget to attach the flame diagram? It doesn't matter so much, I can follow the argument also without the picture. I am not even sure whether deserialising samples eagerly like in C++ gives much benefit, because then you end up relying on a copy constructor. Now there is an elegant solution for that (return pointers to samples owned by the middleware, i.e., borrow them, i.e. use loans), and that should be possible using the LoanedSamples type.

In addition, I also found that when using listener, wrapper function also takes up a considerable amount of cpu consumption. Is there room for optimization here?

The overhead in the listener wrapper is also more than I'd like, I'm sure it can be made faster ...

but the actual performance c language is much better.

This seems to be the case in more situations than one would expect. Now of course the C++ binding has a handicap, in that it wraps around the C API, but that should have a negligible effect. What you're probably also seeing is that the C++ binding simply gets less love from the key developers. Certainly in my case, that's in no small part because I just don't like the language ...

Thanks to you (and others) measuring and pointing out deficiencies, it does improve. If you happen to spot something and have an idea for improving it, please feel free to suggest it!

TTT-321 commented 1 year ago

@eboasson I find that when using a listener, each call to take() gets the actual sizeof samples in rhc to allocate space for the samples. In take() the following function is executed: c_sample_pointers_size = dds_reader_lock_samples(ddsc_entity); In fact, c_sample_pointers_size gets 1 every time, which is why I want to pre-allocate space to avoid this operation. I set c_sample_pointers_size = 1, do not carry out rhc sample size read operation, the throughput rate from the original 280Mb/s to 400Mb/s, although this is still a big gap with c language throughput rate (600Mb/s).

When using polling and waitSet, dds_reader_lock_samples(ddsc_entity) will return value greater than 1, but the code has fixed maxsize = 100. When I make 100 bigger, this will improve throughput. I don't know what this 100 means.