eCAL 5.7.1 & iceoryx losing messages if having multiple subscribers

Philip-Kovacs commented 3 years ago

Hello!

I've built eCAL 5.7.1 with iceoryx with the following cmake command: cmake .. -DCMAKE_BUILD_TYPE=Release -DECAL_THIRDPARTY_BUILD_PROTOBUF=ON -DECAL_THIRDPARTY_BUILD_CURL=OFF -DECAL_THIRDPARTY_BUILD_HDF5=ON -DHAS_CAPNPROTO=ON -DBUILD_APPS=OFF -DBUILD_SAMPLES=ON -DBUILD_TIME=ON -DECAL_LAYER_ICEORYX=ON

Running ecal_sample_latency_snd and ecal_sample_latency_rec_cb along with RouDi give fine results, however, the samples running multiple instances of subscribers produce the following output:

publisher:

sudo ./ecal_sample_multiple_snd
CUDPSender: Setting TTL failed: Invalid argument
CUDPSender: Setting TTL failed: Invalid argument
2020-09-30 17:11:01.161 [ Info  ]: Application registered management segment 0x7f5d3e8000 with size 113341568 to id 1
2020-09-30 17:11:01.161 [ Info  ]: Application registered payload segment 0x7f39c39000 with size 595259200 to id 2
...
pub  109:       3199 Msg/s
pub  110:       3199 Msg/s
pub  111:       3199 Msg/s
pub  112:       3199 Msg/s
pub  113:       3199 Msg/s
pub  114:       3199 Msg/s
pub  115:       3199 Msg/s
...
Sum:          639964  Msg/s
Sum:             639 kMsg/s
Sum:               0 MMsg/s

receiver:

./ecal_sample_multiple_rec_cb
CUDPSender: Setting TTL failed: Invalid argument
CUDPSender: Setting TTL failed: Invalid argument
create subscribers ..
2020-09-30 17:10:53.850 [ Info  ]: Application registered management segment 0x7f693e8000 with size 113341568 to id 1
2020-09-30 17:10:53.851 [ Info  ]: Application registered payload segment 0x7f45c39000 with size 595259200 to id 2
...
sub  109:          0 Msg/s
sub  110:          0 Msg/s
sub  111:          0 Msg/s
sub  112:       3199 Msg/s
sub  113:          0 Msg/s
sub  114:          0 Msg/s
sub  115:          0 Msg/s
...
Sum:            3200  Msg/s
Sum:               3 kMsg/s
Sum:               0 MMsg/s

I suppose all subscribers should receive the corresponding message, not just one. I tried to run an other sample code:

// create 2 publishers
eCAL::CPublisher pub1("foo1", "std::string");
eCAL::CPublisher pub2("foo2", "std::string");

// sending "hello world" on 2 different topics
while(eCAL::Ok())
{
  pub1.Send("hello");
  eCAL::Process::SleepMS(1000);
  pub2.Send("world");
}

// define a subscriber callback function
void OnReceive(const char* topic_name_, const std::string& message_)
{
  printf("We received %s on topic %s\n.", message_.c_str(), topic_name_);
}

// create 2 subscriber
eCAL::string::CSubscriber sub1("foo1");
eCAL::string::CSubscriber sub2("foo2");

// register subscriber callback function
auto callback = std::bind(OnReceive, std::placeholders::_1, std::placeholders::_2);
sub1.AddReceiveCallback(callback);
sub2.AddReceiveCallback(callback);

// idle main thread
while(eCAL::Ok())
{
  // sleep 100 ms
  std::this_thread::sleep_for(std::chrono::milliseconds(100));
}

In this case, only sub2 is triggered, given 'foo2' in the output. Swapping sub1 declaration with sub2, output will change to 'foo1'. In case of even more subscribers, only the last declarated one triggers. However, splitting each subscriber into different processes, and running parallel each, everything works as they meant to be. But I need to have them all in one process.

The ecal.ini file(s) were left as default, having:

network_enabled= true
inproc_rec_enabled= true
shm_rec_enabled= true
udp_mc_rec_enabled= true
npcap_enabled= false
use_inproc=0
use_shm=2
use_udp_mc=2

Anyways, if I change ecal.ini content, the situation above still exists.

If so, what am I doing wrong? I would appreciate if anyone could help me with this issue.

Philip-Kovacs commented 3 years ago

Sorry, one more important thing to know: I used iceoryx version 0.16.1 in this build.

budrus commented 3 years ago

for iceoryx it normally should make no difference whether it is in one process or many. The question is if there is a difference between intra and inter-process in eCAL. A thing that could help in analyzing is the iceoryx introspection that shows you the processes, publishers, subscribers and their connection. Check here for more info.

rex-schilasky commented 3 years ago

Thank you for the bug report. We did not test that setup for now in all use cases. The iceoryx binding is still experimental. I will reproduce that and try to figure out the problem. You can use the the default ecal.ini file if you link against iceoryx no need to change anything here. Your use case is using "intra process" communication only, even there are two subscribers in the second process. I would recommend to work with the "standard" eCAL shared memory layer for now until the issue is fixed. For small payloads (<256 kb) there is no performance difference.

Philip-Kovacs commented 3 years ago

Thank you for the replies. I had the ecal_sample_multiple_rec_cb and ecal_sample_multiple_snd apps rebuilt to have 3 subscribers and 3 publishers (with id 0-1-2), and icecrystal shows a subscription for only publisher no. 2.

test

Using the standard ecal layer gives proper results, all subscribers' requests are fulfilled.

rex-schilasky commented 3 years ago

So multiple subscribers did not work in the same process (in the case of iceoryx binding), that seems to be the general issue, right ?

Philip-Kovacs commented 3 years ago

Indeed, just only the last declared one seems to be in operation. I'm running it on Ubuntu 16.04, architecture is aarch64, gcc version is 5.4.0.

budrus commented 3 years ago

In the introspection it looks like the other subscribers are not created. If you start RouDi with debug loglevel iox-roudi -l debug you see a printf for every subscriber that is created Created new ReceiverPortImpl for application.... This should come three times in this examaple. If it is printed only once, then I would assume that the eCAL layer does not create the other subscribers

rex-schilasky commented 3 years ago

Currently looking in the eCAL iceoryx reader interface and it's most likely buggy. There seems to be one instance overwriting another. Stupid issue so far ... We only made some basic tests with the iceoryx binding and did not use it productive, because still miss some feature. I will fix this now anyway.

rex-schilasky commented 3 years ago

The issue is fixed on the current master. @Philip-Kovacs can you please confirm it to work ?

Philip-Kovacs commented 3 years ago

Thank you for the quick fix. I upgraded my build to version 5.7.2, with iceoryx 17. The multiple sender-receiver sample now runs as expected(every instance receives), however it prints the following output several times, and many messages are lost:

ICEORYX ERROR!

Mempool [m_chunkSize = 16448, numberOfChunks = 1000, used_chunks = 998 ] has no more space left
MemoryManager: unable to acquire a chunk with a payload size of 1088
The following mempools are available:
  MemPool [ ChunkSize = 192, PayloadSize = 128, ChunkCount = 10000 ]
  MemPool [ ChunkSize = 1088, PayloadSize = 1024, ChunkCount = 5000 ]
  MemPool [ ChunkSize = 16448, PayloadSize = 16384, ChunkCount = 1000 ]
  MemPool [ ChunkSize = 131136, PayloadSize = 131072, ChunkCount = 200 ]
  MemPool [ ChunkSize = 524352, PayloadSize = 524288, ChunkCount = 50 ]
  MemPool [ ChunkSize = 1048640, PayloadSize = 1048576, ChunkCount = 30 ]
  MemPool [ ChunkSize = 4194368, PayloadSize = 4194304, ChunkCount = 10 ]
Senderport [ service = "eCAL", instance = "", event = PUB_0 ] is unable to acquire a chunk of with payload size 1088
POSH__SENDERPORT_ALLOCATE_FAILED

This issue occures with iceoryx 0.16.1 as well.

rex-schilasky commented 3 years ago

Hi, I fixed another issue with multiple subscribers in the same process in case of the same topic name reported by @budrus. This will be merged soon into the master.

However that will not change the mentioned behaviour in your last comment. Maybe @budrus can check the icoryx log messages.

How many publisher and subscriber did you run in your multiple send and multiple receive setup ? These samples are normally used to check the performance with lot's of connections with maximum send speed,, they should stress the transport layer at a maximum. Maybe something has to be preconfigured in the iceoryx toml configuration file to handle that many pub/subs ?

Philip-Kovacs commented 3 years ago

Yes maybe you're right, maybe overloading the layer. I was running the sample first with the original binaries with 200 publishers, then I reduced this number to 3. The error was still raised for all PUB 1, 2, and 3. Anyways, I will do some more tests.

rex-schilasky commented 3 years ago

I checked my setup on Ubuntu with iceoryx layer and 10 publications and 10 subscriptions. If I run them at maximum speed I get the same error like you after a few seconds runtime at publisher side. Adding a sleep of 1 ms after 10 send actions is not causing the error. Seems to me that chunks are not released when sending on max speed or that in general there is still an issue on publisher side and sending slower will just raise the error later.

Philip-Kovacs commented 3 years ago

Hello! I've put a delay of 0.1 ... 1 ms after every send action (3 publishers). With the delay, the communication wasn't disrupted, same as you wrote. When I shut the receiver side down, there is a burst of this error on the publisher side for a second, then it goes back to normal idle state.

budrus commented 3 years ago

This error comes when the memory pool is running out of chunks. We currently provide a segregated free list approach, where you have a configured number of memools, each with a size and a number of chunks. As we are doing true zerop-copy and no additional memory allocation during runtime, this configuration gives you the number of available memory chunks. If all are used, you end up in this error. By default the configuration in /etc/iceoryx is used. You can provide another config via command line. See here for details.

Another important point are the queues on subscriber side. You can configure the queue size for a subscriber via a c'tor parameter. If this is not provided a default is taken. This can be configured via the CMake option IOX_MAX_CHUNKS_HELD_PER_SUBSCRIBER_SIMULTANEOUSLY. See here for details. I guess eCAL does not provide a queue size nor changes the default, so the default is 256

I guess you now have the problem that you are sending with multiple publishers and as fast as possible. According to your configuration above you have 1000 chunks for 16KB payload. If your subscribers cannot consume as fast as your publishers provide new data, we start queueing up. When the queue capacity is reached we start dropping the oldest samples. We currently have no interference from subscribers to publishers. I.e. we do not block the publisher if there is no more chunk or a queue overflow. You end up in an error or in loosing chunks. Maybe we will provide the possibility to block the publisher until chunks are available or queues have free space in future, but currently this was not our use case. But I see that for such setups it might be the better option even if this has negative effects on publisher timings

So I would assume that your samples start queuing up and if you have multiple publishers connected to multiple receivers with a queue size of 256 you reach the point that your 1000 samples are not enough. So what to do? The queue size can be reduced, then your are maybe start loosing samples earlier but you also need less in total. Or you can increase the number of chunks in your 16 KB mempool. Our do a combination of these two measures. Our philosophy is currently

You have to provide enough memory chunks for your system to properly operate in all scenarios
How many chunks you need depends on frequency of publishers and subscribers as well as on the queue size of the individual subscribers
If your subscriber cannot consume fast enough you either have a problem or it is fine for you that oldest samples are dropped when your queue is overflowing. But still you need enough chunks in the memory pool for not having a problem with this queued chunks

The sleeps you guys are doing is a solution to ensure that the publishers are not providing samples faster than subscribers can consume. I guess eCAL is using callbacks to consume samples of iceoryx subscribers. The question for eCAL would be if and how many samples shall be queued if new samples arrive faster than they can be consumed

rex-schilasky commented 3 years ago

@budrus thank you for that detailed explanation. So from my point of view this issue is fixed and to run specific scenarios iceoryx needs to be configured the right way (as you described).

eclipse-ecal / ecal

eCAL 5.7.1 & iceoryx losing messages if having multiple subscribers #92