Reconsider the IMedia design to support low-copy/zero-copy rx/tx operation

pavel-kirienko commented 7 months ago

https://github.com/OpenCyphal-Garage/libcyphal/pull/343#discussion_r1573289494

pavel-kirienko commented 6 months ago

Thinking aloud.

At the moment, we're focusing only on transmission. For reception, see the [LibCyphal design document, section "Subscriber"]().

In line with what we discussed on the forum a while back, we could make IMedia provide a memory resource for serving dynamic memory for message objects:

    virtual std::pmr::memory_resource& getMemoryResource() = 0;

Then, as also covered on the forum, we expose some easy-to-use handle at the top layer of the library that the client can use to allocate messages to be transmitted; that handle would (through indirection perhaps) eventually invoke the above pure virtual method implemented by the media layer. The simplest possible implementation would simply return something like the new_delete_resource, as covered by Scott on the forum; a more advanced one could leverage specific memory regions that are DMA-reachable or are otherwise advantageous for storing the data to be transmitted.

Nunavut would then serialize the data such that the serialization buffer is allocated from the same memory resource in one or more fragments; if the message contains sections that happen to contain data in a wire-compatible format (little-endian, correct padding, IEEE 754, etc.), such sections will not be serialized but instead a view of them will be emitted as part of the output of the serialization routine; when data requires conversion it will be copied into a new temporary buffer from the same PMR.

At the output we get some variation of vector<buffer> that needs to be passed on into the lizard; the original message object must be kept alive. See https://github.com/OpenCyphal/libcanard/issues/223 (this one is libcanard but it applies also to the other lizards).

The lizard will then push into its TX queue a sequence of TX items, where each item is an object referencing an ordered list of memory views pointing into the original list of fragments supplied by Nunavut. The lizard can also allocate additional memory for the payload from the same memory resource in order to inject protocol headers into the output TX queue; for libudpard this would mean the UDP frame headers and also the transfer CRC; for libcanard this includes the transfer CRC, padding, and the tail bytes.

At this stage, we get an ordered list of TX queue items, where each item contains:

Header and/or tail fragment(s) allocated from the media PMR and owned by the TX item itself (meaning they must be freed after the item is transmitted);
A list of views pointing into the original vector<buffer> from Nunavut of an arbitrary length which are to be concatenated (preferably by the scatter-gather DMA controller, or by the software itself in the absence thereof) upon transmission.

The transport implementation would then be responsible for deallocating the memory fragments as they are consumed by the media layer, finally deallocating the original message object when no references to it are left. This sounds convoluted but I think it should be possible to approach this sensibly by introducing some notion of reference counting on memory fragments; this is out of the scope right now.

One major issue with this direct approach is that it doesn't easily allow splitting outgoing transfers across multiple network interfaces, as the allocator is tied to a specific media instance. We could consider two options:

Some platforms may be able to utilize the same memory region for all NICs. For example, it could be a memory area reachable by the DMA controllers attached to distinct network interfaces, or DTCM (without DMA), etc. In this case, attaching the PMR to a specific media ceases to make sense, as it should instead be extracted into a new independent entity of a buffer memory manager that can work with all media instances. In cases where such is not possible, it would degrade into an ordinary PMR without any superpowers. This PMR is to be accessed by all parties: the application (for message object allocation), by Nunavut (for buffer fragment allocation), and by the lizard (for frame metadata buffer allocation). This option requires careful reference counting if a message is sent over multiple network interfaces concurrently, especially if they are managed by different lizards; this can potentially get rather complex.
We could allow the lizards to perform full deep copying, but only once. This choice allows us to keep dedicated memory managers per media instance and, at the same time, avoid reference counting across distinct lizards. A desirable side-effect is that the application will no longer need to transfer ownership of its message object upon transmission, allowing such objects to be allocated statically rather than from the heap (or whatever memory manager the media PMR is underpinned by); Nunavut can likewise operate using static memory buffers without requiring heap allocation during deserialization -- this is not incompatible with the concept of scattered buffers and low-copy serialization of large data blobs that do not require conversion (such as byte order swapping etc).

At the moment, the second option appears more compelling, so it will be pursued moving forward.

thirtytwobits commented 6 months ago

Before I can comment on fully, do we intend on having a hierarchy of redundancy where the top layer is managed by the transport layer across multiple media devices (e.g. when using CAN and UDP as redundant channels) and the next layer is managed within each media layer device (e.g. when using multiple CAN peripherals) but looks like a single device to the transport layer?

pavel-kirienko commented 6 months ago

No, per the design intention, one IMedia maps to one NIC. I mean, one could conceivably implement IMedia that manages more than one NIC, but this would count as off-label use, your mileage would vary, and the warranty would be void.

A lizard can manage an arbitrary number of redundant media instances. You may recall that we discussed in the past that there is some room for improvement regarding how it's currently done, but even when the improvements are in place, the fundamental model will stay the same: one IMedia = one NIC.

As any given lizard implements only a particular Cyphal transport protocol, heterogeneous redundancy requires a higher-level aggregation where the specifics of a given transport protocol are abstracted away; we will be approaching this like in PyCyphal -- with the help of the RedundantTransport protocol. See the diagram in the design doc, section "Homogeneous and heterogeneous redundancy".

thirtytwobits commented 6 months ago

We could allow the lizards to perform full deep copying, but only once.

My concern is, without serializing a message directly into an output buffer, the user is doomed to perform full deep copying N times where N is the number of redundant interfaces. If we design a way for deserialization to occur directly into an output buffer this becomes N - 1 deep copies after serialization.

thirtytwobits commented 6 months ago

...need to transfer ownership of its message object upon transmission

If we are serializing into an output buffer then the application doesn't need to transfer ownership of the object-representation. The memory we are obtaining from IMedia for transmission would be distinct from the memory we obtained for reception where we de-serialized the data into object form*.

* This thread is about transmission but when we get to reception we should talk about lazy deserialization as a feature

pavel-kirienko commented 6 months ago

My concern is, without serializing a message directly into an output buffer, the user is doomed to perform full deep copying N times where N is the number of redundant interfaces. If we design a way for deserialization to occur directly into an output buffer this becomes N - 1 deep copies after serialization.

This is true, but observe how the reference counting across several lizards required by this approach can potentially turn into a formidable can of worms :worm:

If we are serializing into an output buffer then the application doesn't need to transfer ownership of the object-representation.

If we apply the low-copy approach across the stack, then the output of Nunavut-generated serialization routines may contain references to the original message object (remember the example with imagery data?), from which follows:

The message object must be allocated in a DMA-reachable (or otherwise usable for efficient transmission) memory region.
The message object must remain intact for an arbitrary amount of time until the transmission is completed.

One way to achieve both is to allocate the message object from that DMA-compatible region and then hand it over to the transport layer upon transmission to let it dispose of it when it is no longer needed.

I am not saying these are insurmountable issues. They are quite manageable. We just need to choose between

thirtytwobits commented 6 months ago

Okay. I suppose your preferred solution is adequate. It may be that libcyphal will always be a bit less optimized then we'd like for MCUs. I've been wondering if we should start a "cyphal-ard" project which is a minimal application-layer in C where such close-to-the-metal integrations would be more plausible?

pavel-kirienko commented 6 months ago

I've been wondering if we should start a "cyphal-ard" project which is a minimal application-layer in C where such close-to-the-metal integrations would be more plausible?

We could entertain that thought just to see if there are sensible solutions to the problem of zero-copy transmission over heterogeneously redundant interfaces, and then try and transfer that back to LibCyphal.

Suppose you started a new C library from scratch and want to send zero-copy messages over CAN and UDP. What would be different compared to where we are now?

pavel-kirienko commented 5 months ago

Here's a full design based on the second option from the second comment in this thread.

We extend IMedia with the memory resource getters:

virtual std::pmr::memory_resource& getTxMemoryResource() = 0;
virtual std::pmr::memory_resource& getRxMemoryResource() = 0;

The first one will be used for the allocation of TX frame payload buffers and ancillary data structures by the lizard. For example, LibUDPard allocates UdpardTxItem and its payload in one memory fragment (the TX item is followed by its payload); LibCANard does the same with CanardTxQueueItem. The lizard or the client can both allocate and deallocate memory using this memory resource. More on this below.

The second one will be used for the allocation of RX frame payload buffers by the IMedia, and their deallocation upon consumption by the lizard or the client. For example, LibUDPard takes ownership of the buffer memory via udpardRxSubscriptionReceive and udpardRxRPCDispatcherReceive; LibCANard currently copies the input data but this may change in the future to follow LibUDPard.

The TX memory resource (MR) will be used to allocate memory for the lizard when it needs to enqueue a new TX item. If that item never makes it to the IMedia (for example, if it times out or the transmission is canceled for other reasons like running out of queue space or memory), the memory is freed using the same MR. If the item actually makes it to IMedia, the IMedia::push takes ownership of the buffer, so that the client doesn't need to free it. What happens to the buffer afterward is none of the client's concern, the media will take care of everything. This is an alternative to what Scott described as the "deallocate to transmit" behavior.

The RX memory resource may map to a DMA-addressable or otherwise RX-optimized memory region. If it does, it may offer benefits to the media implementation, allowing it to forward data received from the hardware to the higher layers very efficiently. The lizard and the client are oblivious to that but it should be noted that the media has no control over how long the lizard/client will keep using the memory as it will typically make it all the way up to Nunavut deserializer, and then possibly even to the application shall Nunavut be able and choose to keep references to the memory instead of copying it during deserialization.

Note that the RX memory resource will only be used for deallocation but never for allocation. In LibUDPard this is expressed through the type system via a special kind of memory resource called UdpardMemoryDeleter.

This design should be implemented both in libcyphal::transport::can and libcyphal::transport::udp. Currently, LibCANard does not allow the client to route allocation requests to different allocators depending on the purpose, as it does LibUDPard via separate UdpardMemoryResource instances, but this capability will be added eventually (see https://github.com/OpenCyphal/libcanard/issues/225); for now, we could simply serve all allocations from the TX memory resource as a stop-gap measure with a huge TODO comment.

pavel-kirienko commented 5 months ago

For now, LibCyphal must require that all IMedia instances used within a transport refer to the same underlying memory resource. This is needed because our lizards free and allocate memory for all redundant interfaces using a shared set of memory resources. For example, frames arriving from distinct redundant transports will all be freed by the same memory resource, and frames destined for transmission via multiple redundant transports will likewise be allocated by the same memory resource.

It is relatively easy to extend the lizards such that they maintain a dedicated memory resource per redundant interface. When that is done, the above requirement that all IMedias use the same PMR will be lifted.

LibUDPard

The TX pipeline is already managing a dedicated MR per redundant transport; see UdpardTx::memory. No changes are needed here.

The RX pipeline will require adjustments:

UdpardRxMemoryResources::payload will be removed.
udpardRxSubscriptionReceive and udpardRxRPCDispatcherReceive will accept a deleter for the payload.
The library will need to store the deleter internally together with each received fragment such that it is able to delegate the deletion to the correct media.

LibCANard

CanardTxQueue will need to have a dedicated memory resource tied to its redundant interface.

canardRxAccept will need to accept a deleter similar to LibUDPard. Internally, that deleter will be stored per RX frame.

OpenCyphal-Garage / libcyphal

Reconsider the IMedia design to support low-copy/zero-copy rx/tx operation #352

LibUDPard

LibCANard