eclipse-iceoryx / iceoryx

Eclipse iceoryx™ - true zero-copy inter-process-communication
https://iceoryx.io
Apache License 2.0
1.68k stars 393 forks source link

Segmentation fault when building with different value for IOX_MAX_CHUNKS_ALLOCATED_PER_PUBLISHER_SIMULTANEOUSLY #2228

Closed MartinCornelis2 closed 7 months ago

MartinCornelis2 commented 8 months ago

Ubuntu 22.04 ROS2 iron

git clone --branch v2.0.5 https://github.com/eclipse-iceoryx/iceoryx.git

I ran into an issue when building iox-roudi from source to change some of the settings described here https://github.com/eclipse-iceoryx/iceoryx/blob/master/doc/website/advanced/configuration-guide.md

I will outline the entire flow of my attempts with observed error messages to show how I tried to resolve the issues I faced and where I eventually ran into a segmentation fault.

Attempt 1:

cmake -Bbuild -Hiceoryx_meta
export RMW_IMPLEMENTATION=rmw_cyclonedds_cpp
export CYCLONEDDS_URI=file://$PWD/cyclonedds_shared_memory_ros2.xml

where 'cyclonedds_shared_memory_ros2.xml' contains

<?xml version="1.0" encoding="UTF-8" ?>
<CycloneDDS xmlns="https://cdds.io/config" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://cdds.io/config https://raw.githubusercontent.com/eclipse-cyclonedds/cyclonedds/iceoryx/etc/cyclonedds.xsd">
    <Domain id="any">
        <Discovery>             
            <ParticipantIndex>none</ParticipantIndex>         
        </Discovery> 
        <SharedMemory>
            <Enable>true</Enable>
            <LogLevel>info</LogLevel>
        </SharedMemory>
    </Domain>
</CycloneDDS>
cd iceoryx/build
cmake --build .
./iox-roudi -c /opt/ros/iron/etc/roudi_config_example.toml

Issues observed: TOO_MANY_CHUNKS_HELD_IN_PARALLEL -could not take sample and 2024-03-21 10:34:20.763 [Warning]: Out of publisher ports! Requested by runtime 'iceoryx_rt_77260_1711013659128840638'

ATTEMPT 2: Fix the publisher limit by repeating previous steps with the following adjustment: cmake -Bbuild -Hiceoryx_meta -DIOX_MAX_PUBLISHERS=4096 -DIOX_MAX_SUBSCRIBERS=4096 Publisher limit fixed, but history capacity is still being exceeded! 2024-03-21 10:43:51.650 [Warning]: Chunk history request exceeds history capacity! Request is 100. Capacity is 1.

ATTEMPT 3: Fix the history limit by repeating previous steps with the following adjustment: cmake -Bbuild -Hiceoryx_meta -DIOX_MAX_PUBLISHERS=4096 -DIOX_MAX_SUBSCRIBERS=4096 -DIOX_MAX_CHUNKS_ALLOCATED_PER_PUBLISHER_SIMULTANEOUSLY=100 -DIOX_MAX_CHUNKS_HELD_PER_SUBSCRIBER_SIMULTANEOUSLY=100 Now when I start roudi even a simple ros2 topic list results in a segmentation fault and similarly everything breaks down when trying to run our stack.

martin@ubuntu-jammy:~/ros/iron/system/src/iceoryx$ ros2 topic list
Segmentation fault

Please let me know if this is indeed a bug or if I am misunderstanding https://github.com/eclipse-iceoryx/iceoryx/blob/master/doc/website/advanced/configuration-guide.md .If I am supposed to make other changes (i.e. increase mempool, change CYCLONEDDS_URI or something else that I am not aware of) then this does not become immediately clear from the tutorials.

elfenpiff commented 8 months ago

@MartinCornelis2 could you please recompile iceoryx with your current setup but in debug mode and show us the backtrace of ros2 topic list.

I think your configuration is valid but would lead to a very huge shared memory management segment and maybe the segmentation fault occurs since you are out of memory somehow?! Could you try to run your setup with smaller port and chunk numbers?

Could you tell us a bit more about your use case and why you are requiring 4096 publishers and subscribers and why they have to hold 100 samples in parallel? Maybe we can together find a more suitable setup.

MartinCornelis2 commented 8 months ago

I think I can lower the number of ports, but it explicitly says Request is 100. Capacity is 1, does that mean I have to change the history depth in my qos in my stack to lower the amount of desired chunks?

Can you also explain what is limiting the amount of memory that can be used? Right now I'm testing this on a laptop but we would like to run this on a robot as well where resources are more limited.

I thought I could increase settings by a bunch, since in the guide it mentions hint With the default values set, the size of iceoryx_mgmt is ~64.5 MByte, which seems very small.

PS: Should I create my own mempool configuration and if so, what things should I consider when creating one? Will everything defined in the mempool immediately be claimed or is it simply an upper bound?

elfenpiff commented 8 months ago

@MartinCornelis2

I think I can lower the number of ports, but it explicitly says Request is 100. Capacity is 1, does that mean I have to change the history depth in my qos in my stack to lower the amount of desired chunks?

Yes. The history you are requesting must also fit in the underlying queue otherwise you request a history of 100, the queue has a capacity of 10, and you effectively get only the last 10 samples. We could adjust it on iceoryx side that you only get a warning and the 90 remaining samples are dismissed.

Can you also explain what is limiting the amount of memory that can be used? Right now I'm testing this on a laptop but we would like to run this on a robot as well where resources are more limited.

The limit is the physically available memory in your machine. Depending on what kind of hardware you are aiming for, you have much harsher restrictions. If you are going for a raspberry pi 5 you have 8GB available, but iceoryx can also be deployed to hardware that has only 32MB available. For this we have the small memory example: https://github.com/eclipse-iceoryx/iceoryx/tree/master/iceoryx_examples/small_memory

I thought I could increase settings by a bunch, since in the guide it mentions hint With the default values set, the size of iceoryx_mgmt is ~64.5 MByte, which seems very small.

But it can grow very quickly. iceoryx is for safety critical systems, therefore it claims every memory on startup. If you have a setup with 4096 Publishers and 4096 subscribers, every publisher has an unused but allocated list of 4096 subscribers, since in theory one publisher could have 4096 subscribers. And there are now, 4096 Publishers - so you end up with a pre-allocated list of 4096 entries and this 4096 times.

PS: Should I create my own mempool configuration and if so, what things should I consider when creating one?

You can use the iceoryx introspection to see what kind of chunks you are using in your setup and configure the memory pool accordingly.

Will everything defined in the mempool immediately be claimed or is it simply an upper bound?

It will be immediately claimed since iceoryx guarantees to you that the configured memory is available for your communication setup.

FYI: We created the company ekxide.io (info@ekxide.io) to provide commercial support for iceoryx. Just in case when you need more in-depth iceoryx support, features etc.

elBoberido commented 7 months ago

@MartinCornelis2 just to be sure that you do not run into a stack problem. Did you try to set ulimit -s unlimited and run your setup?

MartinCornelis2 commented 7 months ago

As a sanity check I built with: cmake -Bbuild -Hiceoryx_meta -DIOX_MAX_CHUNKS_ALLOCATED_PER_PUBLISHER_SIMULTANEOUSLY=2

with or without setting ulimit -s unlimited I still get the seg fault.

real-time non-blocking time  (microseconds, -R) unlimited
core file size              (blocks, -c) 0
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 0
file size                   (blocks, -f) unlimited
pending signals                     (-i) 62614
max locked memory           (kbytes, -l) 65536
max memory size             (kbytes, -m) unlimited
open files                          (-n) 1048576
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 0
stack size                  (kbytes, -s) unlimited
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) 62614
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited
ros2 topic list
2024-03-26 09:36:28.035 [Warning]: RouDi not found - waiting ...
2024-03-26 09:36:38.057 [Warning]: ... RouDi found.
Segmentation fault
echo $CYCLONEDDS_URI
file:///home/martin/ros/iron/system/src/iceoryx/cyclonedds_shared_memory_ros2.xml
echo $RMW_IMPLEMENTATION
rmw_cyclonedds_cpp

NOTE: I am using podman, however the setup has worked before when building without the DIOX_MAX_CHUNKS_ALLOCATED_PER_PUBLISHER_SIMULTANEOUSLY flag. I did run ulimit -s unlimited both in the terminal where I started my podman and within the podman itself. I don't know much about memory management, so I thought I'd add this information in case it is relevant.

elBoberido commented 7 months ago

@MartinCornelis2 can you provide a backtrace, e.g. with something like

gdb --batch \
   --ex "shell printf '\n\033[33m#### Local Variables ####\033[m\n'"  --ex "info locals" \
   --ex "shell printf '\n\033[33m#### Threads ####\033[m\n'"          --ex "info threads" \
   --ex "shell printf '\n\033[33m#### Shared Libraries ####\033[m\n'" --ex "info sharedlibrary" \
   --ex "shell printf '\n\033[33m#### Stack Frame ####\033[m\n'"      --ex "info frame" \
   --ex "shell printf '\n\033[33m#### Register ####\033[m\n'"         --ex "info register" \
   --ex "shell printf '\n\033[33m#### Backtrace ####\033[m'"          --ex "thread apply all bt" \
   --core coreDumpFile binaryFile
MartinCornelis2 commented 7 months ago

Just as a sanity check, should it be possible to do: git clone --branch v2.0.5 https://github.com/eclipse-iceoryx/iceoryx.git to just get roudi from source while having everything else installed through apt?

This setup did work for changing the max amount of publishers, but maybe there is more that I have to build from source in the case of the history depth?

My colleague was able to reproduce the segmentation fault by building roudi from source with: cmake -Bbuild -Hiceoryx_meta -DIOX_MAX_CHUNKS_ALLOCATED_PER_PUBLISHER_SIMULTANEOUSLY=2 while having everything else (i.e. rmw_cyclonedds) apt installed.

elBoberido commented 7 months ago

@MartinCornelis2 no, you need to rebuild everything when you change the compile time options. There are parts which are shared between RouDi and the applications and IOX_MAX_CHUNKS_ALLOCATED_PER_PUBLISHER_SIMULTANEOUSLY is one of them since the data structure in the shared memory changes. On one hand, the application needs to know how many chunks it can store and on the other hand RouDi needs to know how many chunks it needs to release in case the applications terminates abnormally. Unfortunately, these are all compile time constants and with the C++ base iceoryx1 one needs to rebuild everything. We have changed this with iceoryx2 and plan to have a C binding with the 0.4 release. Maybe that is something which is interesting for you. It also does not require a central daemon like RouDi.

MartinCornelis2 commented 7 months ago

My apologies for any confusion I may have caused!

Would it be possible to give me a quick rundown of what I need to do to use cyclonedds and iceoryx together in a ROS2 iron setup (including what needs to be built form source together)?

I'm asking this, because I already got quite confused by the fact that rmw_iceoryx exists, but there is also an rmw_cyclonedds that uses iceoryx for shared memory and then there is also apparently an iceoryx2.

I'm afraid I'll end up googling the wrong things or trying to smash blocks together that are not supposed to fit together.

elBoberido commented 7 months ago

Unfortunately I don't know about the details on how to build Cyclone DDS with the iceoryx integration but you need to build at least the full iceoryx from the sources. If Cyclone DDS is using iceoryx as shared libs, you might get away with just replacing the libs. If it uses iceoryx as static lib you need to rebuild Cyclone DDS as well.

elBoberido commented 7 months ago

@MartinCornelis2 since we now know the root cause of the segfault, can we close this issue?