eclipse-ecal / ecal

📦 eCAL - enhanced Communication Abstraction Layer. A high performance publish-subscribe, client-server cross-plattform middleware.
https://ecal.io
Apache License 2.0
831 stars 174 forks source link

Intermittently `/dev/shm` files are not created (eCAL 5.10 and docker) #815

Open bsirang opened 1 year ago

bsirang commented 1 year ago

Describe the bug I am running into an issue where intermittently my local eCAL applications are not using shared memory. Restarting my applications generally resolves the issue. This is usually observed when my applications launch on system start up, but it is intermittent.

I am running my eCAL applications in separate docker containers, and I am using --network host --ipc host --pid host.

Also, I am trying to force only shared memory usage via the following ecal.ini settings

[publisher]
use_inproc                = 0
use_shm                   = 2
use_tcp                   = 0
use_udp_mc                = 0

When the /dev/shm files are missing the pub/sub data does appear to be flowing, so it's not like the connections are broken.

To Reproduce

  1. Boot the system
  2. Observe eCAL applications have launched
  3. Check /dev/shm and observe it does not have files created for the publishers and subscribers
  4. Restart the applications
  5. Observe that a bunch of /dev/shm files are then created

Expected behavior I am expecting that /dev/shm files were created always and an application restart is not necessary

Desktop (please complete the following information):

Additional context At this point, I'm more interested in learning the details of how eCAL chooses which transport to use so that I can better debug issues like this. Given that the use_shm is the only non-zero option in the .ini file, does eCAL still try to determine which transport to use dynamically in this case? Can someone point me to this logic in the eCAL source code?

I am wondering if there's some kind of race condition at play here. I believe I encounter this issue mostly on a cold system boot.

FlorianReimold commented 1 year ago

I cannot really say why this happens to you, I think we need to dig a little deeper. But what I would guess right now, is that your docker containers start before the OS has fully initialized the network.

On startup, an eCAL node will join a well-known UDP multicast group and listen to a well known port. By default, it is 239.0.0.1:14000. It can be changed in the ecal.ini, even though there usually isn't a reason to do that. We call this the monitoring layer and if you start the eCAL Monitor you will see a graphical representation of the messages sent on that layer. If your interfaces aren't up, yet, the multicast group cannot be joined, the monitoring layerwill not be able to match publishers and subscribers and no data will flow between your nodes. Keep in mind, that this is necessary either way, even for pure-local SHM communication, as this lets the eCAL nodes discover each other before they even start sending data via SHM.

Debugging

If you see this happening, you can execute the following command:

netstat -g

This will print the multicast joins for all interfaces. I guess it will show up something invalid or unwanted, there.

Forcing SHM

To force SHM, you should switch the layer to 1 (->ON). Your current setting is 2 (-> Auto), which will only publish data via SHM if there are local subscribers. When setting SHM to 1 you should immediatelly see some files in /dev/shm. This will however probably not help you, as subscriber doesn't know about the filenames.

[publisher]
use_inproc                = 0
use_shm                   = 1
use_tcp                   = 0
use_udp_mc                = 0

Future

The next eCAL release (5.11) will bring two features related to this:

  1. A shared-memory-based monitoring layer that will not depend on network interfaces for local communication
  2. An option to join the multicast group on all interfaces, so if you switch interfaces later you will not have to restart eCAL nodes
KerstinKeller commented 1 year ago

@bsirang one question: when you are saying this is happening with 5.10, does it also mean it didn't happen with 5.9?

(We have internally encountered some problems which occured after 5.9 to 5.10 upgrade, and even though we did our best, we were not able to pinpoint them exactly. So we changed the strategy to leave things as they are for now and do some major internal cleanup and better testing for 5.11).

Also when you say, that the data is flowing with no shm files, @FlorianReimold is that anyhow releated to the unlinking of SHM files? But I guess non of those commits are in any of the releases yet.

At this point, I'm more interested in learning the details of how eCAL chooses which transport to use so that I can better debug issues like this. Given that the use_shm is the only non-zero option in the .ini file, does eCAL still try to determine which transport to use dynamically in this case? Can someone point me to this logic in the eCAL source code?

Also, our configuration, is more than "suboptimal". Basically, it's possible to missconfigure your system so that actually no communication is taking place. eg. by putting udp = 2 and shm=0, then you won't see any local communication unless there is an external subscriber on a different host, in which case a subscriber on the same host will also see the udp traffic.

For eCAL 6 (and maybe earlier if we find a good strategy for converting the ini files), we plan to do it differently. There would be a intraprocess, interprocess, interprocess config items, which you can set to the appropriate layers (shm, upd, tcp, ...) and not vice versa.

bsirang commented 1 year ago

@FlorianReimold What's interesting is that in the cases where the /dev/shm files do not exist, communication still seemed functional, so it was as if UDP was being used rather than SHM. I didn't sniff packets to prove that though.

I should mention also that when my applications start up in this state, if I try to then run another tool that subscribes after the system is up, I get shm_open failures due to file not found, which to me implies that the subscriber believes that shm should exist.

I was also thinking it smelled like a race condition related to networking bringup. I will make the config change to use_shm = 1 to see if my problem goes away.

@KerstinKeller It may or may not have happened in v5.9. I actually don't know because I wasn't initially trying to force shm when I was on v5.9, and hadn't done any analysis. Regarding the unlinking of shm files, I am aware of that issue also as I have noticed shm files don't get cleaned up. I did see the issue related to it, so I am also waiting on that fix.

bsirang commented 1 year ago

Is there a release date for v5.11? I am looking forward to the upgrade.

FlorianReimold commented 1 year ago

If /dev/shm contained no files, it probably fell back to UDP. You can check the layer with the eCAL Monitor (right-click the column headers to display the layer column):

image

eCAL 5.11 is scheduled for November 2022: https://github.com/eclipse-ecal/ecal/milestone/3

Can you create an as-simple-as-possible example, maybe just using ecal_sample_person_snd + ecal_sample_person_rec that shows that issue? Something that I can test out with an Ubuntu VM to see if I can reproduce the issue?

bsirang commented 1 year ago

I have more interesting and confusing findings.

First, the switch to

[publisher]
use_inproc                = 0
use_shm                   = 1
use_tcp                   = 0
use_udp_mc                = 0

Didn't seem to help. I still encountered a scenario where my applications were up but /dev/shm looked like this:

$ ls -l /dev/shm/
total 32
-rw-rw-rw- 1 root root  104 Oct  6 11:21 ecal_shutdown_process_14324_evt
-rw-rw-rw- 1 root root  104 Oct  6 12:10 ecal_shutdown_process_16424_evt
-rw-rw-rw- 1 root root  104 Oct  6 12:22 ecal_shutdown_process_16897_evt
-rw-rw-rw- 1 root root  104 Oct  6 12:35 ecal_shutdown_process_17139_evt
-rw-rw-rw- 1 root root  104 Oct  6 12:43 ecal_shutdown_process_17372_evt
-rw-rw-rw- 1 root root  104 Oct  6 12:45 ecal_shutdown_process_17500_evt
-rw------- 1 root root 4096 Oct  6 11:20 PHS-7107
-rw----rw- 1 root root 4096 Oct  6 11:20 PHS-7640

I didn't have ecal_mon_gui readily available in this environment, but I have a CLI tool that can print the topics.proto data, and I found that the tlayer field was in fact set to tl_ecal_shm for all of my publishers.

{
 "rclock": 2,
 "hid": 0,
 "hname": "my-hostname",
 "pid": 12532,
 "pname": "/path/to/app/foobar",
 "uname": "foobar",
 "tid": "30917891712",
 "tname": "/battery/state",
 "direction": "publisher",
 "ttype": "proto:trellis.core.TimestampedMessage",
 "tdesc": "CtQBChlnb29nbGUvcHJvdG9idWYvYW55LnByb3RvEg9nb29nbGUucHJvdG9idWYiJgoDQW55EhAKCHR5cGVfdXJsGAEgASgJEg0KBXZhbHVlGAIgASgMQnYKE2NvbS5nb29nbGUucHJvdG9idWZCCEFueVByb3RvUAFaLGdvb2dsZS5nb2xhbmcub3JnL3Byb3RvYnVmL3R5cGVzL2tub3duL2FueXBiogIDR1BCqgIeR29vZ2xlLlByb3RvYnVmLldlbGxLbm93blR5cGVzYgZwcm90bzMK7wEKH2dvb2dsZS9wcm90b2J1Zi90aW1lc3RhbXAucHJvdG8SD2dvb2dsZS5wcm90b2J1ZiIrCglUaW1lc3RhbXASDwoHc2Vjb25kcxgBIAEoAxINCgVuYW5vcxgCIAEoBUKFAQoTY29tLmdvb2dsZS5wcm90b2J1ZkIOVGltZXN0YW1wUHJvdG9QAVoyZ29vZ2xlLmdvbGFuZy5vcmcvcHJvdG9idWYvdHlwZXMva25vd24vdGltZXN0YW1wcGL4AQGiAgNHUEKqAh5Hb29nbGUuUHJvdG9idWYuV2VsbEtub3duVHlwZXNiBnByb3RvMwrmAQomdHJlbGxpcy9jb3JlL3RpbWVzdGFtcGVkX21lc3NhZ2UucHJvdG8SDHRyZWxsaXMuY29yZRoZZ29vZ2xlL3Byb3RvYnVmL2FueS5wcm90bxofZ29vZ2xlL3Byb3RvYnVmL3RpbWVzdGFtcC5wcm90byJqChJUaW1lc3RhbXBlZE1lc3NhZ2USLQoJdGltZXN0YW1wGAEgASgLMhouZ29vZ2xlLnByb3RvYnVmLlRpbWVzdGFtcBIlCgdwYXlsb2FkGAIgASgLMhQuZ29vZ2xlLnByb3RvYnVmLkFueWIGcHJvdG8z",
 "tlayer": [
  {
   "type": "tl_ecal_shm",
   "version": 0,
   "confirmed": true,
   "parShm": ""
  }
 ],
 "tsize": 2499,
 "connectionsLoc": 3,
 "connectionsExt": 0,
 "messageDrops": 0,
 "did": "7018130021014966792",
 "dclock": "4487",
 "dfreq": 986,
 "attr": {}
}

However, when I run my CLI tool, which attempts to subscribe to the topic, I get this error

$ docker run -it -v /opt/deployment:/opt/deployment -v /opt/deployment/etc/ecal/ecal.ini:/etc/ecal/ecal.ini --network host --ipc host --pid=host ubuntu:22.04 /opt/deployment/bin/trellis-cli topic echo -w -r 1 -t /battery/state
shm_open failed : /_battery_state_30917984_shm errno: No such file or directory

My only thought at this point was that the docker containers were actually not using the hosts IPC namespace despite --ipc host flag. However, I ran this command

docker ps --quiet --all | xargs docker inspect --format '{{ .Name }}: IpcMode={{ .HostConfig.IpcMode }}'

Which confirmed that the docker containers were running with host IPC.

So I am puzzled how the /dev/shm files could be missing while the monitoring layer says shm is used, and while fields like connectionsLoc are showing a non-zero value.

Any thoughts?

bsirang commented 1 year ago

@FlorianReimold / @KerstinKeller curious if you have any theories for how /dev/shm files may be missing while at the same time the monitoring layer is reporting tl_ecal_shm ?

FlorianReimold commented 1 year ago

Difficult to say, in theory there could just be something that deletes the files. So in order to make an educated guess, I will need more information:

If you would get us a manual on how to reproduce your setup (e.g. some dockerfiles and shell commands) I could also try to reproduce and debug it and then maybe give a better explanation or even fix it.

bsirang commented 1 year ago

Do you have a method of using tcpdump or similar to sniff the UDP traffic?

You can use Wireshark to sniff all traffic on an interface.

I wonder if this will just go away when we upgrade.

Honestly, I don't believe it will. But it's worth a try, anyways. We did change a few things in the SHM code.

Maybe one important note on the SHM files: Those are actually not required for SHM communication. After the communication has been established (i.e. the SHM files have been mapped to process memory by Publisher and Subscriber), the named entries in /dev/shm can be deleted. This will not delete the actual shared memory, only the name that is needed for opening it.

KerstinKeller commented 1 year ago

I wonder if it makes sense that we can have a call / screenshare to take a deeper look.

We're not using ecal in conjunction with docker extensively, so some behaviors might have changed without us noticing.

There is some (too much?) logic in eCAL, when trying to figure out how one eCAL publisher will communicate with interested subscribers. If course it wants to send only on layers where other processes are listening.

When dealing with docker containers, it might not work as reliably, as the heuristics to figure out which layer to apply might fail. I recall we did make some changes in the logic between 5.9 and 5.10.

bsirang commented 1 year ago

@FlorianReimold good point about the fact that the shared memory files can disappear and the memory mapped region can still be active and in use.

I will instrument some additional logging in ecal_memfile_linux.cpp and will see if I can reproduce the issue

bsirang commented 1 year ago

I reproduced the issue after adding more instrumentation. I believe what's happening is the files are created, the subscribers are happy, and then the files are later deleted. Since the subscribers already had them open they are happy.

I believe they are getting deleted in my system environment due to this: https://superuser.com/questions/1117764/why-are-the-contents-of-dev-shm-is-being-removed-automatically

I will do more testing to confirm that the systemd RemoveIPC option is the culprit.

FlorianReimold commented 1 year ago

Ah, that's interesting! I didn't know about that. So basically your Docker containers are started before the user is logged in?

bsirang commented 1 year ago

@FlorianReimold yes we run our eCAL services via systemd and they launch automatically on boot. They run as a user account with UID 1000, which is in the non-system UID range and thus systemd does the IPC garbage collection on it (which I didn't know was a feature).

I have not yet seen the problem after changing that configuration option, so I believe this issue can be closed.

It may be worth a call out in some documentation somewhere that some system services, such as systemd, may clean up /dev/shm files, which can cause problems. This was not related to docker after all.