About dynamic memory (de)allocation

Hi! Is it possible to operate in mode, which there is no work with dynamic memory? That is, the allocation of resources (memory) is performed to the maximum (in accordance with the QOS/configuration) during the initialization of the application. And further, in the process of discovery, receiving/sending messages over the network, the call to the malloc(), realloc() and free() functions is excluded. This functionality is important to avoid CPU-intensive dynamic memory operations and to improve the real-time performance of an application that uses DDS.

Hi @i-and. Unfortunately no. At this point it's not possible to run without dynamic memory allocation or do all allocations when the process is started. It is possible to tweak allocation behavior. Whether or not allocation will be a problem depends a bit on your data flow as ddsi allocates buffers to contain multiple samples at once and only allocates extra buffers if required. Therefore, if your data flow is limited, a single buffer allocation might just be enough. It's usually best to do some testing and see if behavior can be adjusted to match your needs. Of course, I'll be happy to help you out where I can.

For my information, just out of curiosity, it seems like you're trying to run Cyclone DDS on a memory constrained device. Can I ask what platform you're using? I'm currently porting Cyclone DDS to FreeRTOS+lwIP, maybe there's some things you're running into that I can solve/use in the process.

Hi @i-and, while @k0ekk0ek's comment is entirely correct, it restricts itself to what is rather than to what is possible with some effort. Making it possible to operate with no (or restricted) dynamic memory allocation is something I personally find worthwhile and interesting, and I think it is not as far away as one might think at first.

Transmit side:

The storage for providing reliability/transient-local data on the writing side is handled by the WHC, which is completely pluggable internally already (there is even a special memory-less one being used for the built-in topics now). Plugging a different one in is more a limitation of the API than of technical feasibility.
The representation of samples is also pluggable (again same considerations about there not being interfaces to easily use a different one). Allocation is handled inside the plug-in, and so a special implementation that has a fixed-size pool (possibly even with direct mapped topic-&-key value to address) is a not a complex project.
The way "xmsgs" and "xpacks" structures needed for transmitting data are managed is currently simply malloc/free, but these are transient things and the number of these required at any one time can be limited already (limiting the number of queued messages for restramit should do a lot for this). So wholesale replacement of malloc/free for something smarter (a topic I will get to below) would help managing these in a static environment.

Receive path:

There is a "challenge" 😁 on the receive path that I have described before (see #28, in particular the "radmin" bit) with the buffering of incoming packets. Defragmentation and reorder buffers can already be limited in the number of samples, though that should be replaced by a limit in the number of bytes to be honest ... The nice thing about defrag/reorder is that this happens before acknowledging the data and that therefore you can dynamically allocate from a fixed-size pool, just like any other caching mechanism. It would probably require wholesale replacement of q_radmin.c but that is not a daunting task.
Memory use by the reader history caches: I haven't gotten around to replacing the global ddsi_plugin abomination yet ... but I think it is fairly obvious that the plan is for each RHC to be completely pluggable just like the WHC already is. My gut feeling is that the current RHC could pretty easily be modified to pre-allocate everything, but even if that feeling were to prove incorrect, having it pluggable would give you a nice way out. (One of the plans with pluggable WHC & RHC is to be able to pick an optimised implementation for the simplest cases ... so for example a simple queue like ROS2 uses could then be implemented as nothing more complicated than that — that'll give a nice performance boost for those simple cases.).
Actual storage of sample data: see above, reader and writer use the same code for this.

Background activities (i.e., the tev thread):

The number of these is directly related to the number of entities + the number of queued retransmit messages + the number of queued discovery messages. Clearly, the number of these is manageable. Now these are dynamically allocated, and I think you'd go mad trying to entirely eliminate the dynamic allocation, but the number of them can be bounded and sizes is pretty much a constant, so using a different allocation strategy (that one again!) could do wonders.

Discovery:

I think this is a bit of a tricky: you have to discover the rest of the world and in the general case that requires an unpredictable amount of memory. I suppose you would only ever run without dynamic allocation in a system of known size, so you could bound it. Preallocating of proxy entities might be tricky, but probably not impossible — and in any case reserving sufficient memory and a special allocator could do a lot.
Then there are the various strings, sequences of strings and other things in the QoS. Don't yet have an easy answer for that ...

Other assorted stuff:

There is the key value to instance id map to look after. Allocating a sufficiently large hash table at startup would probably address the problem.
The entities in the dds_* files (because the above mostly is about the lower levels in the stack) are one-on-one related to creation of entities in the application, rather than with the data. So that would be manageable — at worst you could just create everything sequentially in the application code, which would ensure the same sequence of allocations each time it is started, and then simply use a preallocated pool of sufficient size. I wouldn't call that "elegant", but it should be effective nonetheless.

So I think getting to static allocation — or very nearly there as a first step — hinges on but a few key things:

sample representation
writer history cache implementation
reader history cache implemenation
"radmin" replacement
replacement of all malloc/free calls by calls into type-specific allocators, where the implementation of these allocators becomes pluggable

I'm really glossing over what exactly I mean by those allocators, but rather than sitting here typing one of the longest comments on GitHub ever, I'd like to point you to two The Slab Allocator: An Object-Caching Kernel Memory Allocator (Bonwick, 1994) and Magazines and Vmem: Extending the Slab Allocator to Many CPUs and Arbitrary Resources (Bonwick and Adams, 2001), two well-known papers originating in the Sun Microsystems kernel engineering group. The reasoning they describe applies here, too, and it will probably make clear what I mean even though it doesn't apply one-on-one. Do keep in mind that the memory allocator design business has evolved since then — Linux used a similar one but then improved on it, the behaviour of modern C library malloc is far better than the old ones, &c.

Not all of this is high on my priority list. What I am after is a very high performance implementation with great freedom in how the various aspects are implemented, and that requires most of the above. I need that freedom to slowly move away from the constraints imposed by the rather limited thinking behind the DDS specification; and that performance to allow providing a good DDS emulation on top of it. To me, the goal is not to build the best DDS implementation (no matter what I found at the very beginning of the README when I first saw it 😁), but to try out some closely related ideas I developed over the years that should form a better basis for building fault-tolerant distributed reactive systems — buzz-wordism! though they are old buzz-words and do date me a bit ...

Anyway, just wanted to give you some ideas on where things can head. You're very much invited to help!

@k0ekk0ek and @eboasson, thanks for your detailed answers and offers of help.

With the target platform to run there Cyclone DDS I have not yet decided. The Cortex-M4-180 MHz or Cortex-M7 - 200/400 MHz RAM up to 1 MB RAM with 100 Mb Ethernet controller is currently under consideration. In this case, the load of the Ethernet channel with application data is 60%. Is it your goal to run the Cyclone DDS stack on the above limited resources?

Not all of this is high on my priority list.

Can you clarify your roadmap for the implementation of these optimizations for working with dynamic memory?

The launch of Cyclone DDS on processor resources of the "microcontroller" type will allow it to be used in distributed microcontroller systems for real-time control tasks (after the implementation of RT-Ethernet extensions at the MAC-level at the next stage). It will be a fantastic result )

Hi @i-and ,

The Cortex-M4-180 MHz or Cortex-M7 - 200/400 MHz RAM up to 1 MB RAM with 100 Mb Ethernet controller is currently under consideration.

Some time ago @k0ekk0ek and I did a quick experiment on a Cortex M4 at 180MHz, FreeRTOS and lwIP and that worked. That experiment is being turned into proper support for that platform by @k0ekk0ek and so one hurdle should be cleared pretty soon.

In this case, the load of the Ethernet channel with application data is 60%.

It depends very much on the size of samples, as the protocol overhead is quite significant for small ones and the processing overhead must not be neglected either. To give some idea, I've been playing a bit with a pair of RPi3s (Raspbian GNU/Linux 8.0) and I need to use samples ≥ ~100 bytes to get a payload rate ≥ 60Mbps.

As the CPU load in that case is close to 100% for a single core, in other words, an M4 would not keep up with that. At 1400 bytes, CPU loads go down to ~ 25%, still a bit on the high side, but the kind of changes to eliminate dynamic allocation will likely improve that.

As things stand today, you'd be pushing it a bit, but given some time, it doesn't seem an unrealistic target for performance.

up to 1 MB RAM

That, too, seems a bit tight — but memory use is also very much dependent on the scale of the system. I would suggest looking at your intended application and working out how many nodes, processes, readers and writers, and even instances you expect. I don't know offhand how much memory that would translate in, but that is something one can determine through analysis or experimentally.

Is it your goal to run the Cyclone DDS stack on the above limited resources?

Yes, I think there is a lot of value in being able to run under these conditions.

Can you clarify your roadmap for the implementation of these optimizations for working with dynamic memory?

Give me a few days — I've promised to strive for a first official release by March 1 and I really want to get a release plan/roadmap to go along with that, and this is something to include.

Hi @eboasson,

It depends very much on the size of samples, as the protocol overhead is quite significant for small ones and the processing overhead must not be neglected either. At 1400 bytes, CPU loads go down to ~ 25%, still a bit on the high side...

Do you have data on which part of the system is experiencing the most performance loss when working under these conditions: the writer or reader, the network stack, the serializer, or somewhere else?

Will it improve the situation a transfer method with batching of samples in a single Ethernet frame? The actual transfer can be done by buffer fill, elapsed time, or explicit flash(). Do you plan to implement this kind of batching/coalescing?

Do you have data on which part of the system is experiencing the most performance loss when working under these conditions: the writer or reader, the network stack, the serializer, or somewhere else?

I do have some: but it is on macOS — Apple's profiling tools are quite a bit nicer than Linux "perf", not that it has made the macOS kernel very fast ... — on an i7 and over the loopback interface. So a rather different environment ... the numbers I mentioned above came from a quick experiment using an RPi3, so a bit better matched to your target.

Quite apart from that mismatch, I realised that most of the measurements so far have been using the throughput example, meaning reliable, KEEP_ALL writers and readers and a dynamically allocated sequence for payload. For the environment you target, I would expect KEEP_LAST readers/writers, fixed-size samples, and quite possibly using best-effort.

So a little bit of measuring is in order. I don't think what you need is out of reach, but I do think it will take some work.

Will it improve the situation a transfer method with batching of samples in a single Ethernet frame? The actual transfer can be done by buffer fill, elapsed time, or explicit flash(). Do you plan to implement this kind of batching/coalescing?

It does perform batching samples in an Ethernet frame. The current batching is really simplistic: you decide whether to batch or not to batch, and if you do batching then it the frames go out when they are full or when explicitly flushed.

Time-based (basically properly implementing the "latency budget" QoS) is definitely planned. There are various other improvements that can be made and that I am totally in favour of (zero-copy networking is a pretty obvious one).

There are various other improvements that can be made and that I am totally in favour of (zero-copy networking is a pretty obvious one).

As experience has shown, in order to provide an acceptable load on the CPU (Intel Desktop at the level of 10%) when transferring a large stream of regular data with traffic of 80 Mbytes per second over Gb Ethernet (for example, streaming uncompressed video), it is necessary to implement the transfer of the samples (image lines) from the application layer to the Ethernet-controller level without additional data copying. At the same time, the application layer forms a sample (line) into a buffer that is returned by the stack and is actually a properly formed MAC/IP/UDP/RTP frame with all the necessary headers (the regular network OS stack is not used). After all the lines of the image are generated, the DMA Ethernet controller is started and the lines are issued to the Ethernet in accordance with the established hardware scheduler (for example, AVB - credit based shaper). In this regard, the question: how realistic in principle to organize the transmission / reception of this kind of traffic using DDS and with the achievement of CPU load (Intel Desktop) about 10%? Or to realize the scheme of transfer like as described above, it is not possible and it is necessary to use a separate specialized low level library forming a stream of samples-lines via RTPS Protocol (with additional Media synchronization mechanisms)? Or DDS is not intended for this mode of operation and in this case it is advisable to use specialized protocols, such as RTP, GigE Vision, IEEE1722,... instead of RTPS? @eboasson, what do you recommend?

@i-and, the DDS-based projects that I personally know of where such amounts of data were transferred commonly opted for using a dedicated transfer mechanism — including PCIe — for those raw data streams, while using DDS to handle everything else, including the configuration information needed for that dedicated mechanism.

That said, that was done with lesser performing implementations, and I don't think Cyclone is already at the edge of its performance. Moreover, the model you describe is one that closely matches the behaviour of a simple unreliable writer/reader pair in DDS and there most of the overhead (however much it is 😁) comes from sharing all the paths with the reliable communications, and just skipping some bits.

I do think something along the lines you sketch is possible in the context of DDS, at least for the simple case of a single best-effort sample per packet. Most of the pieces of the puzzle should fit as far as I can tell now; indeed, one could experiment with it even in the current state of Cyclone. The buffer management could be handled by a special transport implementation — to have a special one is only reasonable, considering that you're expressly avoiding the normal networking stack — combined with a special internal sample representation that uses the buffers that come from the transport instead of allocating their own. The part where some real changes would be required would be the generation of message headers: now they are allocated dynamically and point to the sample, but for this to work, it would have to write them inside the buffer already holding the sample data.

Constructing a fast path exclusively for non-fragmented, non-packed, send-and-forget data — one that avoids all dynamic allocation and can write the protocol headers inside some buffer space provided by the sample buffer — seems eminently feasible. I am quite positive that the writing side allows this without changing a lot in the way you interact with DDS. Instead of calling "write" and passing it pointer to the data, you'd probably have to obtain a sample buffer, fill it and pass it to a different write function, but that's not rocket science.

On the receive side, I'm sure there are some issues to consider as well; I haven't thought much about those yet. One important observation is that the "loan" mechanism means the basic elements are available for keeping the samples in receive buffers and returning pointers to these.

So I am pretty sure it can be done with a reasonable amount of effort and without adding significant complications to the existing inner workings of the implementation. The gains are obvious: a unified mechanism ... I'd say it is worth further investigation.

Hi @eboasson, your technical vision of the issues under consideration inspires! Summarizing the discussion, it would be nice if the roadmap for the development of Cyclone DDS would include the following:

in the process of operation to eliminate the work with dynamic memory;
to ensure operation on microcontrollers (as an option: FreeRTOS+lwIP+Cortex M4 at 180MHz);
to implement the mode of transmission with burst samples in one Ethernet frame by the criterion of "time";
to minimize the amount of data copying, starting from the application level;
to implement fast data path to ensure the transmission and reception of the stream at the level of 80 - 100 MB per second over Gb Ethernet with support for reliable transmission mode.

A reliable transmission mode for a fast path would be very useful, for example, when transferring a Packed video stream. Since in this case the loss of one sample leads to the loss of the entire image frame or the whole sequence of frames. At the same time, to ensure deterministic transmission of samples, retransmission should be allowed only within the specified time interval. For configuration here probably it would be possible to use "lifespanQosPolicy".

Hi @i-and. I do not know when all of the points on you list are going being worked on, but FreeRTOS+lwIP is currently number one on my list. I'm combining this with some restructuring of the abstraction layer so that future ports and running on a FreeRTOS+lwIP simulator become a lot easier. For Cyclone DDS we are aiming to make FreeRTOS+lwIP (hopefully FreeRTOS's native stacks FreeRTOS+UDP and FreeRTOS+TCP too at a later stage) a supported platform. The simulator will be usedto verify compatibility with every pull request.

eclipse-cyclonedds / cyclonedds

About dynamic memory (de)allocation #99