eclipse-cyclonedds / cyclonedds

Eclipse Cyclone DDS project
https://projects.eclipse.org/projects/iot.cyclonedds
Other
876 stars 359 forks source link

About dynamic memory (de)allocation #99

Open i-and opened 5 years ago

i-and commented 5 years ago

Hi! Is it possible to operate in mode, which there is no work with dynamic memory? That is, the allocation of resources (memory) is performed to the maximum (in accordance with the QOS/configuration) during the initialization of the application. And further, in the process of discovery, receiving/sending messages over the network, the call to the malloc(), realloc() and free() functions is excluded. This functionality is important to avoid CPU-intensive dynamic memory operations and to improve the real-time performance of an application that uses DDS.

k0ekk0ek commented 5 years ago

Hi @i-and. Unfortunately no. At this point it's not possible to run without dynamic memory allocation or do all allocations when the process is started. It is possible to tweak allocation behavior. Whether or not allocation will be a problem depends a bit on your data flow as ddsi allocates buffers to contain multiple samples at once and only allocates extra buffers if required. Therefore, if your data flow is limited, a single buffer allocation might just be enough. It's usually best to do some testing and see if behavior can be adjusted to match your needs. Of course, I'll be happy to help you out where I can.

For my information, just out of curiosity, it seems like you're trying to run Cyclone DDS on a memory constrained device. Can I ask what platform you're using? I'm currently porting Cyclone DDS to FreeRTOS+lwIP, maybe there's some things you're running into that I can solve/use in the process.

eboasson commented 5 years ago

Hi @i-and, while @k0ekk0ek's comment is entirely correct, it restricts itself to what is rather than to what is possible with some effort. Making it possible to operate with no (or restricted) dynamic memory allocation is something I personally find worthwhile and interesting, and I think it is not as far away as one might think at first.

Transmit side:

Receive path:

Background activities (i.e., the tev thread):

Discovery:

Other assorted stuff:

So I think getting to static allocation — or very nearly there as a first step — hinges on but a few key things:

I'm really glossing over what exactly I mean by those allocators, but rather than sitting here typing one of the longest comments on GitHub ever, I'd like to point you to two The Slab Allocator: An Object-Caching Kernel Memory Allocator (Bonwick, 1994) and Magazines and Vmem: Extending the Slab Allocator to Many CPUs and Arbitrary Resources (Bonwick and Adams, 2001), two well-known papers originating in the Sun Microsystems kernel engineering group. The reasoning they describe applies here, too, and it will probably make clear what I mean even though it doesn't apply one-on-one. Do keep in mind that the memory allocator design business has evolved since then — Linux used a similar one but then improved on it, the behaviour of modern C library malloc is far better than the old ones, &c.

Not all of this is high on my priority list. What I am after is a very high performance implementation with great freedom in how the various aspects are implemented, and that requires most of the above. I need that freedom to slowly move away from the constraints imposed by the rather limited thinking behind the DDS specification; and that performance to allow providing a good DDS emulation on top of it. To me, the goal is not to build the best DDS implementation (no matter what I found at the very beginning of the README when I first saw it 😁), but to try out some closely related ideas I developed over the years that should form a better basis for building fault-tolerant distributed reactive systems — buzz-wordism! though they are old buzz-words and do date me a bit ...

Anyway, just wanted to give you some ideas on where things can head. You're very much invited to help!

i-and commented 5 years ago

@k0ekk0ek and @eboasson, thanks for your detailed answers and offers of help.

With the target platform to run there Cyclone DDS I have not yet decided. The Cortex-M4-180 MHz or Cortex-M7 - 200/400 MHz RAM up to 1 MB RAM with 100 Mb Ethernet controller is currently under consideration. In this case, the load of the Ethernet channel with application data is 60%. Is it your goal to run the Cyclone DDS stack on the above limited resources?

Not all of this is high on my priority list.

Can you clarify your roadmap for the implementation of these optimizations for working with dynamic memory?

The launch of Cyclone DDS on processor resources of the "microcontroller" type will allow it to be used in distributed microcontroller systems for real-time control tasks (after the implementation of RT-Ethernet extensions at the MAC-level at the next stage). It will be a fantastic result )

eboasson commented 5 years ago

Hi @i-and ,

The Cortex-M4-180 MHz or Cortex-M7 - 200/400 MHz RAM up to 1 MB RAM with 100 Mb Ethernet controller is currently under consideration.

Some time ago @k0ekk0ek and I did a quick experiment on a Cortex M4 at 180MHz, FreeRTOS and lwIP and that worked. That experiment is being turned into proper support for that platform by @k0ekk0ek and so one hurdle should be cleared pretty soon.

In this case, the load of the Ethernet channel with application data is 60%.

It depends very much on the size of samples, as the protocol overhead is quite significant for small ones and the processing overhead must not be neglected either. To give some idea, I've been playing a bit with a pair of RPi3s (Raspbian GNU/Linux 8.0) and I need to use samples ≥ ~100 bytes to get a payload rate ≥ 60Mbps.

As the CPU load in that case is close to 100% for a single core, in other words, an M4 would not keep up with that. At 1400 bytes, CPU loads go down to ~ 25%, still a bit on the high side, but the kind of changes to eliminate dynamic allocation will likely improve that.

As things stand today, you'd be pushing it a bit, but given some time, it doesn't seem an unrealistic target for performance.

up to 1 MB RAM

That, too, seems a bit tight — but memory use is also very much dependent on the scale of the system. I would suggest looking at your intended application and working out how many nodes, processes, readers and writers, and even instances you expect. I don't know offhand how much memory that would translate in, but that is something one can determine through analysis or experimentally.

Is it your goal to run the Cyclone DDS stack on the above limited resources?

Yes, I think there is a lot of value in being able to run under these conditions.

Can you clarify your roadmap for the implementation of these optimizations for working with dynamic memory?

Give me a few days — I've promised to strive for a first official release by March 1 and I really want to get a release plan/roadmap to go along with that, and this is something to include.

i-and commented 5 years ago

Hi @eboasson,

It depends very much on the size of samples, as the protocol overhead is quite significant for small ones and the processing overhead must not be neglected either. At 1400 bytes, CPU loads go down to ~ 25%, still a bit on the high side...

Do you have data on which part of the system is experiencing the most performance loss when working under these conditions: the writer or reader, the network stack, the serializer, or somewhere else?

Will it improve the situation a transfer method with batching of samples in a single Ethernet frame? The actual transfer can be done by buffer fill, elapsed time, or explicit flash(). Do you plan to implement this kind of batching/coalescing?

eboasson commented 5 years ago

Do you have data on which part of the system is experiencing the most performance loss when working under these conditions: the writer or reader, the network stack, the serializer, or somewhere else?

I do have some: but it is on macOS — Apple's profiling tools are quite a bit nicer than Linux "perf", not that it has made the macOS kernel very fast ... — on an i7 and over the loopback interface. So a rather different environment ... the numbers I mentioned above came from a quick experiment using an RPi3, so a bit better matched to your target.

Quite apart from that mismatch, I realised that most of the measurements so far have been using the throughput example, meaning reliable, KEEP_ALL writers and readers and a dynamically allocated sequence for payload. For the environment you target, I would expect KEEP_LAST readers/writers, fixed-size samples, and quite possibly using best-effort.

So a little bit of measuring is in order. I don't think what you need is out of reach, but I do think it will take some work.

Will it improve the situation a transfer method with batching of samples in a single Ethernet frame? The actual transfer can be done by buffer fill, elapsed time, or explicit flash(). Do you plan to implement this kind of batching/coalescing?

It does perform batching samples in an Ethernet frame. The current batching is really simplistic: you decide whether to batch or not to batch, and if you do batching then it the frames go out when they are full or when explicitly flushed.

Time-based (basically properly implementing the "latency budget" QoS) is definitely planned. There are various other improvements that can be made and that I am totally in favour of (zero-copy networking is a pretty obvious one).

i-and commented 5 years ago

There are various other improvements that can be made and that I am totally in favour of (zero-copy networking is a pretty obvious one).

As experience has shown, in order to provide an acceptable load on the CPU (Intel Desktop at the level of 10%) when transferring a large stream of regular data with traffic of 80 Mbytes per second over Gb Ethernet (for example, streaming uncompressed video), it is necessary to implement the transfer of the samples (image lines) from the application layer to the Ethernet-controller level without additional data copying. At the same time, the application layer forms a sample (line) into a buffer that is returned by the stack and is actually a properly formed MAC/IP/UDP/RTP frame with all the necessary headers (the regular network OS stack is not used). After all the lines of the image are generated, the DMA Ethernet controller is started and the lines are issued to the Ethernet in accordance with the established hardware scheduler (for example, AVB - credit based shaper). In this regard, the question: how realistic in principle to organize the transmission / reception of this kind of traffic using DDS and with the achievement of CPU load (Intel Desktop) about 10%? Or to realize the scheme of transfer like as described above, it is not possible and it is necessary to use a separate specialized low level library forming a stream of samples-lines via RTPS Protocol (with additional Media synchronization mechanisms)? Or DDS is not intended for this mode of operation and in this case it is advisable to use specialized protocols, such as RTP, GigE Vision, IEEE1722,... instead of RTPS? @eboasson, what do you recommend?

eboasson commented 5 years ago

@i-and, the DDS-based projects that I personally know of where such amounts of data were transferred commonly opted for using a dedicated transfer mechanism — including PCIe — for those raw data streams, while using DDS to handle everything else, including the configuration information needed for that dedicated mechanism.

That said, that was done with lesser performing implementations, and I don't think Cyclone is already at the edge of its performance. Moreover, the model you describe is one that closely matches the behaviour of a simple unreliable writer/reader pair in DDS and there most of the overhead (however much it is 😁) comes from sharing all the paths with the reliable communications, and just skipping some bits.

I do think something along the lines you sketch is possible in the context of DDS, at least for the simple case of a single best-effort sample per packet. Most of the pieces of the puzzle should fit as far as I can tell now; indeed, one could experiment with it even in the current state of Cyclone. The buffer management could be handled by a special transport implementation — to have a special one is only reasonable, considering that you're expressly avoiding the normal networking stack —  combined with a special internal sample representation that uses the buffers that come from the transport instead of allocating their own. The part where some real changes would be required would be the generation of message headers: now they are allocated dynamically and point to the sample, but for this to work, it would have to write them inside the buffer already holding the sample data.

Constructing a fast path exclusively for non-fragmented, non-packed, send-and-forget data — one that avoids all dynamic allocation and can write the protocol headers inside some buffer space provided by the sample buffer — seems eminently feasible. I am quite positive that the writing side allows this without changing a lot in the way you interact with DDS. Instead of calling "write" and passing it pointer to the data, you'd probably have to obtain a sample buffer, fill it and pass it to a different write function, but that's not rocket science.

On the receive side, I'm sure there are some issues to consider as well; I haven't thought much about those yet. One important observation is that the "loan" mechanism means the basic elements are available for keeping the samples in receive buffers and returning pointers to these.

So I am pretty sure it can be done with a reasonable amount of effort and without adding significant complications to the existing inner workings of the implementation. The gains are obvious: a unified mechanism ... I'd say it is worth further investigation.

i-and commented 5 years ago

Hi @eboasson, your technical vision of the issues under consideration inspires! Summarizing the discussion, it would be nice if the roadmap for the development of Cyclone DDS would include the following:

A reliable transmission mode for a fast path would be very useful, for example, when transferring a Packed video stream. Since in this case the loss of one sample leads to the loss of the entire image frame or the whole sequence of frames. At the same time, to ensure deterministic transmission of samples, retransmission should be allowed only within the specified time interval. For configuration here probably it would be possible to use "lifespanQosPolicy".

k0ekk0ek commented 5 years ago

Hi @i-and. I do not know when all of the points on you list are going being worked on, but FreeRTOS+lwIP is currently number one on my list. I'm combining this with some restructuring of the abstraction layer so that future ports and running on a FreeRTOS+lwIP simulator become a lot easier. For Cyclone DDS we are aiming to make FreeRTOS+lwIP (hopefully FreeRTOS's native stacks FreeRTOS+UDP and FreeRTOS+TCP too at a later stage) a supported platform. The simulator will be usedto verify compatibility with every pull request.