Open rbx opened 1 year ago
@ktf Can you provide more context here? What are you running, how many messages, their size. What impact does this have?
Should the message count be a run-time or compile-time argument? e.g.
std::vector<MessagePtr> CreateMessages(size_t n, std::vector<size_t> const & sizes, std::vector<Alignment> const & alignments);
assert(n == sizes.size());
assert(n == alignments.size())
vs
template <size_t N>
std::array<MessagePtr, N> CreateMessages(std::array<size_t, N> const & sizes, std::array<Alignment, N> const & alignments); // or spans
I guess, in case of a failure to create a single msg, the whole call is expected to fail?
I guess, you are asking for a bulk api on a single transport only?
This is ITS readout, but it's a general issue IMHO. Whenever we use the shallow copy to increase parallelism on the same data when running in shared memory, we need to copy messages one by one as done in:
https://github.com/AliceO2Group/AliceO2/blob/dev/Framework/Core/src/DataProcessingDevice.cxx#L637
It would be good if there was a way to reduce the impact of the mutex.
Should the message count be a run-time or compile-time argument? e.g.
Runtime.
I guess, in case of a failure to create a single msg, the whole call is expected to fail?
I would say so.
I guess, you are asking for a bulk api on a single transport only?
You mean that all the messages of the bulk are created on the same transport? Yes. See the line above to see what I actually want to optimise.
Since this mutex is internal to the boost shm library, we can only save it by making a big allocation from the boost shmem segment and then chunk it up according to the requested fmq msgs. However, when the fmq msgs are destroyed, we would have to track and delay the deallocation on the boost shmem segment until all msg we chunked the large allocation into, are destroyed.
Will not be trivial, and we will have to see in measurements, how this compares performance-wise. It could be that we only shift the overhead to msg destruction time - which may of course still help your use case. But just to set the expectation right.
It would be a different story of course, if boost shmem would also provide a bulk allocation api, but this I don't know (yet).
There is a bulk API, but it may have fragmentation effects (positive or negative), as it will attempt to put all buffers contiguously: https://www.boost.org/doc/libs/1_83_0/doc/html/interprocess/managed_memory_segments.html#interprocess.managed_memory_segments.managed_memory_segment_advanced_features.managed_memory_segment_multiple_allocations
Need to also see how it deals with (custom) alignment.
Ok, but how about we simplify the problem? In the end what I need to do is to shallow copy messages. How about we optimise that case?
Ok, but how about we simplify the problem? In the end what I need to do is to shallow copy messages. How about we optimise that case?
Ya, this is what I am wondering still, why do you see an allocation at all. I guess, you are hitting this case?
Ok, but how about we simplify the problem? In the end what I need to do is to shallow copy messages. How about we optimise that case?
Ya, this is what I am wondering still, why do you see an allocation at all. I guess, you are hitting this case?
hmm, even if we speculatively pre-allocate those 2-byte buffer to only pay the overhead only once every couple of shmem::Message::Copy()
s, I guess, we will still need a bulk Copy()
api, because we would need to synchronize access to the pre-allocated refcount buffer pool as well, right? maybe we could get away with one or two atomic integers for this?
Yes, I suppose in this particular case we are in that case, assuming UR means UnmanagedRegion. Are those allocations in shared memory? That could explain the fragmentation as well, no?
Yes, they are in shared memory. Certainly they contribute to overall fragmentation.
Either we make a bulk call for Copy to store them together. Or in this case, since we know the exact data size (header to store the ref count...but still an unknown number of them) we need, we could make use of a pool allocator.
I would definitely use a pool allocator for those. We probably have thousands of them being created for each timeframe. It would actually be good if we could also have a bulk allocator for messages below size 256 bytes or so, since every other message in our datamodel is an around 80 bytes header.
Small update on this. In the dev branch there is now first implementation using pools.
For reference counting messages from UnmanagedRegion, it will internally store these in a separate segment (1 per UnmanagedRegion) and use a pool to track them. Technically this is still syncing on every message, but it should be faster since it is 1. a separate segment from the main data and 2. uses a pool. Also, this certainly should improve fragmentation in the main data segment, as now nothing is allocated there for UR region messages. The segment size is 10,000,000 bytes by default, which should be good for up to 5,000,000 messages (probably a bit less). If you expect this much or more, you can bump the size in region configuration: https://github.com/FairRootGroup/FairMQ/blob/dev/fairmq/UnmanagedRegion.h#L137.
Could you give it a try and provide some feedback on how this behaves with the bottlenecks that you observe?
A further step could be to make use of cached pools to reduce synchronization. These could be effective with a bulk message copy API.
Thanks! I will try ASAP.