airo-ugent / airo-mono

Python packages for robotic manipulation @ IDLab AI & Robotics Lab - UGent - imec
https://airo.ugent.be
MIT License
16 stars 1 forks source link

Multiprocess: Shared Memory not always cleaned up when process crashed #98

Closed Victorlouisdg closed 5 months ago

Victorlouisdg commented 11 months ago

Describe the bug The MultiprocessRGBPublisher creates several shared memory files to operate. These are prefixed with what we call a namespace, e.g. "zed_top" or "camera", they can be seen by running:

ls -l /dev/shm

Output:

total 9024
-rw------- 1 victor victor 3686400 Dec  8 10:11 camera_depth
-rw------- 1 victor victor 2764800 Dec  8 10:11 camera_depth_image
-rw------- 1 victor victor      24 Dec  8 10:11 camera_depth_image_shape
-rw------- 1 victor victor      16 Dec  8 10:11 camera_depth_shape
-rw------- 1 victor victor       8 Dec  8 10:11 camera_fps
-rw------- 1 victor victor      72 Dec  8 10:11 camera_intrinsics
-rw------- 1 victor victor 2764800 Dec  8 10:11 camera_rgb
-rw------- 1 victor victor      24 Dec  8 10:11 camera_rgb_shape
-rw------- 1 victor victor       8 Dec  8 10:11 camera_timestamp

I've noticed during development when that publisher process is terminated incorrectly, these files are not released/removed. This is a problem when you try to restart the publisher, because you get:

FileExistsError: [Errno 17] File exists: '/camera_rgb'

Possible solution We could check whether these SHM files exist on creation of a new publisher and close them. However this could lead you to accidentally closing another camera's shared memory while it's still publishing. I'm not sure wheter this is an issue or not.

Victorlouisdg commented 9 months ago

One solution would be to allow publishers to .unlink() any already existing shared memory in its namespace. This does however seem to cause processes that read from that memory to freeze. So maybe the best solutions is: detect -> unlink -> raise exception. Then everything should be cleaned for when you start new publishers/receivers.

m-decoster commented 6 months ago

One solution would be to allow publishers to .unlink() any already existing shared memory in its namespace. This does however seem to cause processes that read from that memory to freeze. So maybe the best solutions is: detect -> unlink -> raise exception. Then everything should be cleaned for when you start new publishers/receivers.

I have for now implemented this solution on the shared_memory_file_exists branch for centrifuge.

Another solution is to catch the exception and set create=False when making the shared memory object.

Will look into this more when we get some time to look at mp in depth.