Open gpalmer-latai opened 10 months ago
@gpalmer-latai it seems you want to use this for GPU computation. I've no experience with GPUs and would need to ask e.g. @elfenpiff but I'm wondering if your goals could also be achieved by extending the publisher and subscriber, e.g. by adding hooks like on_construction
, on_destruction
, on_pre_publish
, on_post_publish
, and so on. I have to think this through if it is easily feasible.
I think alternative 1 will be quite a lot of work. Funnily this is exactly the approach we are taking with the Rust implementation. There is exactly one memory pool per service. Having something like that in iceoryx would also possible but is definitely nothing that can be achieved in the short term.
Regarding option 2. With the introspection you might be able to find a memory pool for a specific service but for that you would also need to know under which user the process with the publisher/subscriber was started and you won't be able to acquire the base pointer. What exactly do you hope to achieve with the information from the introspection?
In the short term something like your proof of concept should be possible. Maybe with a more constraint API but that is more of a detail. My preferred solution would be the refactoring of the posh layer similar to the Rust implementation where we experimented with these ideas and have good experience.
Having on_construction
instead of exposing getMemPoolInfo
would work nicely for me yes. That only really avoids exposing the last layer of lifting I did (and of course one could always pass a lambda that lifts the info out themselves).
To give you a bit more context on my use case - I'm working on a platform with shared graphics memory. That means that the CPU and GPU can access the same memory. But this doesn't just work out of the box - you have to first call an NVIdia API - specifically cudaHostRegister. This API takes a simple void*, size, and some flags and what it effectively does is "pins" the memory - page-locking it. I'm fuzzy on the details beyond that but I believe this has implications on the TLB and associated memory hardware and the reason why I want to restrict the range of memory I call this on as much as possible is because, quoted from the above documentations:
Page-locking excessive amounts of memory may degrade system performance, since it reduces the amount of memory available to the system for paging. As a result, this function is best used sparingly to register staging areas for data exchange between host and device.
Furthermore, I cannot simply call this API every time I receive a message sample because it is a high latency operation, so ideally I want to make all the appropriate calls once early on when initializing the system.
Anyway. My plan is to get a short term solution working for my needs and I would very happy if we could work out a reasonable way to do this upstream here. But I also agree with you that the proper long term solution is what you have already described in the Rust implementation. I'd be interested in having a lengthier discussion sometime about the work involved with bringing that to Iceoryx (or adopting the Rust implementation). I might be able to avoid needing it now, but I suspect there will be more use cases/requirements I stumble upon where I will be really sad not to have this level of control.
There should be a way forward to bring this upstream. I think I have to take counsel with my pillow to think a little bit more about the best short term approach, especially what implication is has for the untyped API. Would you contribute the patch once we agreed on the API?
Once the announcement for the Rust implementation is out we can have a deep dive. Either as part of a developer meetup or a separate meeting. I guess @elfenpiff is also eager to participate.
Yeah I should have some latitude to work on the patch.
Yeah I should have some latitude to work on the patch.
Hahaha, great pun :)
I had a few thought on this topic and currently I favor to continue on the idea we had when PortConfigInfo
was introduced. All of this could be done in the publisher. Just for clarification, do you need to call cudaHostRegister also on the subscriber side or only on the publisher side?
I'd need the ability to map memory pools do be able to process messages as both input and output to GPU accelerated algorithms. So yes, subscriber too.
With the portconfiginfo - you could specify a different memory type for a port, but then how does this hook in to facilitating different hardware. I presume iceoryx isn't going to pull in Nvidia dependencies, so you'd need like the ability for the user to register behavior globally for different memory types.
I've to ask the author of that code what the exact idea behind PortConfigInfo
but I thought about either having a compile time switch to add the CUDA dependency or maybe a similar mechanism like with the platform where one can override the path to some hooks.
I like the idea of a compile time switch, possibly coupled with some bazel select statements. Then just have a compile time path that either calls the CUDA API or returns an error when a specific memory type is specified in the port config.
One thing to note though is that it appears the port config / memory_info will apply to the entire memory segment and it is currently not possible to configure multiple segments differing only by what type of memory they support.
We'd want to probably extend the Roudi config this way so that a publisher with a port config specifying "pinned host memory" would be told to use the corresponding segment.
That's basically the idea. The end user would still not have easy access to the mempool and therefore also cannot easily tamper with it but it gives other projects the option to extend the framework.
Yes, mid to long term the config should be extended. How to do it exactly might be something to discuss in a meetup. Maybe together with the discussion on how to bring the improvements of the Rust based implementation to the C++ implementation.
Yes, mid to long term the config should be extended
Actually I mean in the short/medium term. I'm really trying to avoid pinning the entire shared memory segment used for all communications just because one publisher wants to use GPU acceleration when publishing messages in one specific mempool.
If the effect of specifying memoryType
as something signifying pinned_host_memory
is that we decide to call cudaHostRegister
when constructing the mepoo_segment
somewhere here, then I'd be forced to do something like put all "GPU publishers/subscribers" in separate linux groups just to get them to use the isolated segment. And I'd like to avoid conflating the linux group / access controls use case with the memory allocation/access one.
Edit: What I linked is the codepath for Roudi setting up the shared memory in the first place. I meant to find the code path where the clients receive the info from Roudi.
Second Edit: The publisher (for example) get's this information here. It then has access through the port()->getMembers()->m_chunkSenderData.m_memoryInfo
. However I'm not sure if there exists any logic currently to do some sort of "post init" to react to that information.
What might be interesting to consider as well is adding an entry to both MePooConfig and memPoolInfo.
This would allow configuration at the memory pool level, which fits with my needs. Then we can see about adding some sort of "post_init" logic that is called after getMiddleWarePublisher
to do setup on the client side. This boil down to the MemoryManager
going through all of the memory pools and for each one with a pinned_host
MemoryType
calling cudaHostRegister
on the memory region.
And this is something that could be added to the roudi config TOML as well at some point, though is also immediately available through manual roudi configuration.
There is still a bit of awkwardness to all of these approaches in that although a segment or a memory pool might be configured as supporting "pinned host memory", really it is ultimately up to the publishers or subscribers to decide if they want to use that feature. You have to consider that there are cases where the publisher is a normal CPU publisher or the subscriber is a normal CPU subscriber. In these processes you wouldn't really want to call cudaHostRegister
at all.
So I guess, what you would do to handle this is to combine the MemPoolInfo
(or segment info) with the PortConfigInfo
to have some logic like:
// MemoryManager.getChunk()
for (auto& memPool : m_memPoolVector)
{
uint32_t chunkSizeOfMemPool = memPool.getChunkSize();
if (chunkSizeOfMemPool >= requiredChunkSize)
{
if (portConfigInfo.memoryInfo.memoryType == PINNED_HOST_MEMORY)
{
if (memPool.getInfo()->memoryType == PINNED_HOST_MEMORY)
{
chunk = memPool.getChunk();
memPoolPointer = &memPool;
aquiredChunkSize = chunkSizeOfMemPool;
break;
}
...
Note this code path is when you are retrieving a chunk which would happen much later than where you want to call cudaHostRegister
on the appropriate segment/pool.
Maybe there is a misunderstanding. What I meant was to only register the mempool the publisher and subscriber are using and not the whole segment. As long as it the memory is shared between CPU and GPU this should be sufficient and RouDi does not need to do any special handling. But I'm no GPU expert so this might be a bit naive.
So with that assumption I thought extending the config would not be required in the short term but only in the mid to long term.
We already have a concept of a shared memory provider, although only used and tested on CPU memory. This could be used to extend the memory types. The idea was to have this on segment level not on mempool level. It is quite long since we discussed about this topic though and some of our assumptions might not hold anymore.
Yes, the publisher and subscriber would opt in by specifying the desired memory type. I think that should be more or less straightforward. What might be more complicated is mixed memory mode where some subscriber work on CPU memory and other work on GPU memory. To solve this in an efficient way we might need to do a whiteboard session.
Oh, and @elfenpiff already had some ideas how to solve this in an efficient way about 4 years ago but there was never time to implement it. Maybe time has come now ... or at least in the not so distant future :smile:
Okay. I think I follow now. I guess all one needs is to specify something in the portConfigInfo
and have that consumed to make the cudaHostRegister
call 🙂 You mentioned earlier one complication though - what about untyped publishers/subscribers?
Do we simply not support those initially with this feature, since we do not know in advance which memory pools they will request?
Related to the above - I've noticed that we appear to always do a linear search for the matching memory pool every time we get a chunk. Would it make sense to optimize this out for the typed endpoints in conjunction with the memory logic that is applied only to the corresponding memory pools? (Also in general we could consider doing a binary search)
Note that there is an approach where you defer until getChunk
to do something like cuda_pinned_host_registry.ensure_pinned(memPool)
to make the call to cudaHostRegister
the first time a chunk from a MemPool is requested. This would work for untyped endpoint too.
It's a bit problematic though because that first publish/subscribe for a particular size message will incur a lot of latency. Though you could mitigate this by "priming the pump" - calling loan_message
of all the sizes you care about on system startup, dropping them, and then proceeding with business as usual.
For the untyped publisher and subscriber we have two options (three after I read your latest response)
likely_used_memory_sizes(iox::vector<uint64_t, ???> list_of_memory_sizes)
loan
/getChunk
; might be a fallback for 2 with a log message warning about the new registrationOh, the typed API would then just call likely_used_memory_sizes
on the port level
That sounds great to me. So let me see if I can summarize:
memoryType
constants somewhere that will be used to populate this field in the MemoryInfo
struct member of PortConfigInfo
. PortConfigInfo
. likely_used_memory_sizes
vector on construction.cuda_runtime_api
is available.pointer_repository
to track registration calls (we may have multiple endpoints requesting the same memory region be pinned. CUDA will set an error code the second time you try to do this so it is better to wrap such calls and cache the information - also for latency reasons). MemoryManager
, extend the API to consume the likely_used_memory_sizes
. This might be used in general for some optimization in getChunk
, but additionally if the compile time flag is active is used to call cudaHostRegister
on all appropriate memory pools.getChunk
- if the requested size matches any of the likely_used_memory_sizes
, we simply return the matching memory pool. Otherwise we do the normal linear search and additionally make the cudaHostRegister
call if the PortConfigInfo
requests it, and if the compile time flag is active. Here some remarks
iox::expected
and leave it to the user to decide what happens when there are not resources available. The builder could also be tied to the runtime or at least have the option to ask the non-default runtime for a port. There is the idea to have multiple RouDi instances running in parallel which do not interfere with each other. In order to write gateways between these instances one would need to have multiple runtimes. This is something in the far future and maybe we can get rid of RouDi in the C++ implementation but having an API which considers this option would be nice.likely_used_memory_size
. A detail which can be decided later on depending on what feels better to usecudaHostRegister
in the PublisherPortUser
not in the MemoryManager
. The MemoryManager
needs the interface from your proof of concept to return the address and size of the mempool for a specific chunk size. The getChunk
method could get an additional parameter with a hint for the index, e.g. index_of_last_used_mempool
. If the size does not match the mempool at the index the current search could be performed or alternatively a binary search. The caching of the known indices could also be done in the PublisherPortUser
since the MemoryManager
resides in the shared memory and we would need to hold a structure for (IOX_MAX_PUBLISHER + IOX_MAX_SUBSCRIBERS) * 2
(aaaargh, I used IOX_MAX_PUBLISHER to define MAX_SERVERS ... facepalm :laughing: )PublisherPortUser
since the repository for cudoHostRegister
would live in process local memory and not in the shared memory ... but maybe I overlooked somethingI think you are right about the PublisherPortUser
owning a lot of the logic 👍
We should probably reserve a range for a curated list of memory types, similar to what IANA does
That sounds reasonable. I'm probably not knowledgable enough to quickly come up with something myself. Perhaps we can just reserve a range of values for now so that users implementing their own logic downstream do so with a value out of that range.
That sounds reasonable. I'm probably not knowledgable enough to quickly come up with something myself. Perhaps we can just reserve a range of values for now so that users implementing their own logic downstream do so with a value out of that range.
Exactly. I've thought about something similar to Qt::UserRole
, which is the offset when defining own roles in Qt's model-view framework. We just need to find a good name and set it to .e.g 0x100
. This should leave us a range which is big enough for whatever we want to use. Downstream can then request to register a specific value directly in iceoryx and after a two weeks discussion on Oʻahu (downstream will take care of the hotel and traveling costs) the team will vote on the inclusion of the new value.
... aaargh, too much Qt coding :laughing: ... we can just define an enum to distinguish between user defined values and the ones defined by iceoryx
Ideally with this we could introduce a builder like API which gives us the opportunity to return an iox::expected and leave it to the user to decide what happens when there are not resources available. The builder could also be tied to the runtime or at least have the option to ask the non-default runtime for a port. There is the idea to have multiple RouDi instances running in parallel which do not interfere with each other. In order to write gateways between these instances one would need to have multiple runtimes. This is something in the far future and maybe we can get rid of RouDi in the C++ implementation but having an API which considers this option would be nice.
So is the idea that the builder is passed as an argument like PortConfigInfoBuilder
and the publisher calls create()
on it? Or is this pattern more of just a convenient way to get an object with reasonable defaults but the ability to override some if necessary?
When you talk about tying the builder to the runtime, I'm guessing you mean that instead of
m_port(iox::runtime::PoshRuntime::getInstance().getMiddlewarePublisher(service, publisherOptions))
You'd have something like
m_port(portInfo.getRuntime().getMiddlewarePublisher(service, publisherOptions, portInfo))
Ah no, I think what you mean is that the builder pattern lets you return an error if the parameters aren't supported, right? It's a way of solving the classic "Constructors don't return anything so you have to raise an exception" problem.
Ah no, I think what you mean is that the builder pattern lets you return an error if the parameters aren't supported, right? It's a way of solving the classic "Constructors don't return anything so you have to raise an exception" problem.
Exactly. We don't use exception and when something went horribly wrong in the ctor we currently terminate. It would be better to use the factory method pattern or the builder pattern and return an expected<Publisher>
.
Interesting. How deep are you thinking of going in this sort of refactor? It looks like the getMiddleware*
functions all emit warnings and return a nullptr, which immediately gets passed to a constructor taking a not_null
pointer...
So we could for example, also use the builder pattern for the port users. We can also refactor the getMiddleware*
functions to be of expected type. And probably there are more rabbit holes like this to jump down.
The getChunk method could get an additional parameter with a hint for the index, e.g. index_of_last_used_mempool
In order to do this we will have to expose the index somehow - maybe instead of a getMemPoolInfo
with a ChunkSettings
overload there should be something like a uint32_t getMemPoolHint(const ChunkSettings& chunkSettings) const;
This function can then both be used as a way to query MemPoolInfo
via the existing getMemPoolInfo
as well as the starting index of a binary search implementation of getChunk
.
The getChunk method could get an additional parameter with a hint for the index, e.g. index_of_last_used_mempool
In order to do this we will have to expose the index somehow - maybe instead of a
getMemPoolInfo
with aChunkSettings
overload there should be something like auint32_t getMemPoolHint(const ChunkSettings& chunkSettings) const;
This function can then both be used as a way to query
MemPoolInfo
via the existinggetMemPoolInfo
as well as the starting index of a binary search implementation ofgetChunk
.
Sounds good. Since this is no user facing API we are free to do anything :)
My idea was to call cudaHostRegister in the PublisherPortUser not in the MemoryManager
On second thought and further study of the current architecture, I think maybe it might actually make sense for the ChunkSender
to do this, since the ChunkSenderData
is where the memory info gets stored, and the ChunkSender
is responsible for
allocating and freeing memory chunks.
So basically, to avoid errors in construction we use the builder type for the chunk sender also, or we alternatively add a tryAllocateHint
method which would be used by the chunkSender
to both get the index of the matching mempool and cache it, as well as do the cuda stuff if appropriate.
For simplicity it probably suffices to make the hint that the chunkSender
stores be a single tuple of ChunkSettings
to uint32_t
which is just a lastUsedHint
. If the requested chunk settings change, the sender gets a new hint, attempts to register with CUDA again, and uses the new hint to get a chunk.
The CUDA calls themselves will be cached by the global registry so even if we end up back at the same hint it will become a noop.
Question about the pointer repository though - is this class thread-safe? I don't see any sort of synchronization around registering a new segment. I assume this is because there is no expectation that this would ever be called from different threads?
It's not practically a problem for my particular use case, but it would be problematic to introduce a code path that is able to be called at any time a publisher or subscriber takes or receives a message, that is not thread safe.
Interesting. How deep are you thinking of going in this sort of refactor? It looks like the
getMiddleware*
functions all emit warnings and return a nullptr, which immediately gets passed to a constructor taking anot_null
pointer...So we could for example, also use the builder pattern for the port users. We can also refactor the
getMiddleware*
functions to be of expected type. And probably there are more rabbit holes like this to jump down.
The whole runtime could be refactored if there is time. The runtime belongs to a few remaining old constructs which have their roots in a R&D department. At that time the focus was more on trying out new things than on creating sophisticated APIs. In the meantime we also got new tools like the expected
and optional
which makes it easier to express specific ideas.
Ideally we could try to model the API in a way that it will become simpler to have the same API for the C++ bindings on the Rust basted next-gen implementation.
Question about the pointer repository though - is this class thread-safe? I don't see any sort of synchronization around registering a new segment. I assume this is because there is no expectation that this would ever be called from different threads?
It's not practically a problem for my particular use case, but it would be problematic to introduce a code path that is able to be called at any time a publisher or subscriber takes or receives a message, that is not thread safe.
There is a related issue #1701. We need to check this.
My idea was to call cudaHostRegister in the PublisherPortUser not in the MemoryManager
On second thought and further study of the current architecture, I think maybe it might actually make sense for the
ChunkSender
to do this, since theChunkSenderData
is where the memory info gets stored, and theChunkSender
is responsible forallocating and freeing memory chunks.
So basically, to avoid errors in construction we use the builder type for the chunk sender also, or we alternatively add a
tryAllocateHint
method which would be used by thechunkSender
to both get the index of the matching mempool and cache it, as well as do the cuda stuff if appropriate.For simplicity it probably suffices to make the hint that the
chunkSender
stores be a single tuple ofChunkSettings
touint32_t
which is just alastUsedHint
. If the requested chunk settings change, the sender gets a new hint, attempts to register with CUDA again, and uses the new hint to get a chunk.The CUDA calls themselves will be cached by the global registry so even if we end up back at the same hint it will become a noop.
I need to think more about this. Unfortunately I'm traveling at the end of the week till next week and won't have much time. Please ping me again for this if I forget to respond.
That's okay. I'm also traveling as well until next Wednesday (It is a holiday weekend in the U.S.). Funnily enough - I am flying to Oahu tomorrow.
There is a related issue https://github.com/eclipse-iceoryx/iceoryx/issues/1701. We need to check this.
So I've thought about it and it seems possible but very difficult to create a situation where one calls e.g. RegisterPtr
from multiple threads. You'd have to be creating shared memory pools in different threads and registering them concurrently, for some reason, without knowing in advance what id you want to assign them.
For this CUDA registry we can possibly get around this problem by registering with a combination of the memory pool id and the segment id.
As for the "duplicate registration" race condition. If this occurs CUDA will set an error code along the lines of "This overlaps with some memory that is already registered". We could choose to ignore that one... though it hides other logic errors like actually accidentally overlapping memory pool info.
That's okay. I'm also traveling as well until next Wednesday (It is a holiday weekend in the U.S.). Funnily enough - I am flying to Oahu tomorrow.
It's been a while since I've been there. Unfortunately it's a bit far away from my place. Have fun :)
@MatthiasKillat in case you are interested in the CUDA stuff :)
So I've started prototyping on this new solution and have encountered an obstacle I don't think we've thought through yet.
For subscribers - it appears as if the subscriber does not know anything about memory pools or chunks until a message has been delivered, or more precisely, pointer information to a shared chunk (through which it can figure out where the memPool
is to call freeChunk
eventually).
Do you have any thoughts on an API that would allow a subscriber to figure out which memory pool messages will come from before it starts receiving messages? I'd like to avoid in the happy path making calls to cudaHostRegister
only after the first message is about to be processed, as this might blow latency budgets on the first iteration of any GPU processing work.
One thing that comes to mind that might be somewhat odd but would at least be relatively "safe" would be adding a const RelativePointer<const mepoo::MemoryManager> m_memoryMgr;
to the chunk_receiver
. This would give the subscriber read-only access to the memPool
information which could let it pass some size information to get access to the MemPoolInfo
using only const
API's of the memory manager.
Ah... but this approach runs into one more problem. Because of access controls we don't actually know which segment a message is going to come from, and therefore cannot acquire the memoryManager information the way the publisher does
What about this alternative approach - We leverage the fact that the PoshRuntime
has access to the segment manager via the IpcRuntimeInterface
. The segment manager in turn has a public API to return all the SegmentMappings. Interestingly here we also have the MemoryInfo struct that is key to supporting this feature.
My thought is - what if we extend this SegmentMapping
struct to include a const pointer to the memory manager for the segment? This would expose the getMemPoolInfo
and getMemPoolHint
methods, thus allowing with my other proposed changes above for the runtime itself to have access to all of the memory managers.
Then, when creating a subscriber, the runtime can simply do an additional step to use either a template type size info or some size hints in the untyped API to find all memory pools the subscriber may read from and preemptively pin them.
In addition to this, we would still do a check when dequeuing a message in the subscriber that the owning memory pool has been pinned, but this check would likely be a noop unless the message we received is a different size than we expected.
I've been busy with other stuff the last days. Will have a look at your ideas tomorrow.
Another idea along this line - if I expose the ability for the PoshRuntime
itself to iterate over SegmentMappings, and the segment mappings in exchange exposed info about individual memory pools, then perhaps we could allow configuring MemoryInfo at the MemPool level such that the following behavior happens:
cudaHostRegister
on every memory pool with memoryType=CUDA_HOST_PINNED
PortConfigInfo
. Following 1, this memory will have already been pinned.I'm currently making me familiar with the code again. Regarding your last idea. If the number of pinned MemPools need to be minimal, this would not work since we do not know which MemPool a publisher or subscriber will use. We would basically need to pin the whole shared memory segment. Maybe I misunderstood your idea
Ah no, you are right and I got too ahead of myself here.
We of course do not want to go around pinning every single memory pool that might "support it" in every single process because each given process probably only really cares about sending/receiving messages on one or a few memory pools.
So I'll backtrack again to the idea that, when creating a subscriber, the runtime goes through each data segment (if there are multiple) and pins the memory pools that match
And in order to support untyped subscribers, we would still need a code path as well to check, upon receipt of a message, if this message came from a new memory pool which has not yet been registered, in which case we simply eat the cost to pin the memory at this time.
Brief feature description
I would like the ability to somehow query the info of a particular memory pool that will be used for a comms endpoint's messages.
Detailed information
A full proof of concept is linked below:
https://github.com/eclipse-iceoryx/iceoryx/compare/master...gpalmer-latai:iceoryx:gpalmer/expose_mempool
The basic gist is that I've added a
m_basePtr
to theMemPoolInfo
struct to inform consumers about where the actual memory pool is located. Then I've exposed some APIs to be able to lift that information all the way out to a typed publisher.Alternatives Considered
Instead of lifting out the memory pool information through the layers of port info, it might be nice/favorable if we could