Improve handling of iRODS/taskflow failures with connection timeouts

Something I noticed in one of the projects on our production server. In one project, multiple landing zones have failed to delete with the error "failed to remove collection". iRODS log displays no errors during this time period.

Looking into Docker Compose logs, it seems connections from celeryd to iRODS have been timing out during this period:

sodar-celeryd-default_1  | [2024-10-13 16:11:55,944: WARNING/ForkPoolWorker-1] Exception ignored in: 
sodar-celeryd-default_1  | [2024-10-13 16:11:55,945: WARNING/ForkPoolWorker-1] <function Connection.__del__ at 0x7fa18c02e8b0>
sodar-celeryd-default_1  | [2024-10-13 16:11:55,945: WARNING/ForkPoolWorker-1] Traceback (most recent call last):
sodar-celeryd-default_1  | [2024-10-13 16:11:55,945: WARNING/ForkPoolWorker-1]   File "/usr/local/lib/python3.8/site-packages/irods/connection.py", line 90, in __del__
sodar-celeryd-default_1  | [2024-10-13 16:11:55,945: WARNING/ForkPoolWorker-1]     
sodar-celeryd-default_1  | [2024-10-13 16:11:55,946: WARNING/ForkPoolWorker-1] self.disconnect()
sodar-celeryd-default_1  | [2024-10-13 16:11:55,946: WARNING/ForkPoolWorker-1]   File "/usr/local/lib/python3.8/site-packages/irods/connection.py", line 306, in disconnect
sodar-celeryd-default_1  | [2024-10-13 16:11:55,946: WARNING/ForkPoolWorker-1]     
sodar-celeryd-default_1  | [2024-10-13 16:11:55,946: WARNING/ForkPoolWorker-1] self.socket = self.socket.unwrap()
sodar-celeryd-default_1  | [2024-10-13 16:11:55,946: WARNING/ForkPoolWorker-1]   File "/usr/local/lib/python3.8/ssl.py", line 1285, in unwrap
sodar-celeryd-default_1  | [2024-10-13 16:11:55,947: WARNING/ForkPoolWorker-1]     
sodar-celeryd-default_1  | [2024-10-13 16:11:55,947: WARNING/ForkPoolWorker-1] s = self._sslobj.shutdown()
sodar-celeryd-default_1  | [2024-10-13 16:11:55,948: WARNING/ForkPoolWorker-1] socket
sodar-celeryd-default_1  | [2024-10-13 16:11:55,948: WARNING/ForkPoolWorker-1] .
sodar-celeryd-default_1  | [2024-10-13 16:11:55,948: WARNING/ForkPoolWorker-1] timeout
sodar-celeryd-default_1  | [2024-10-13 16:11:55,948: WARNING/ForkPoolWorker-1] : 
sodar-celeryd-default_1  | [2024-10-13 16:11:55,948: WARNING/ForkPoolWorker-1] The read operation timed out
sodar-celeryd-default_1  | 2024-10-13 16:11:55,990 [ERROR] taskflowbackend.flows: Exception in run_flow(): Failed to remove collection
sodar-celeryd-default_1  | 2024-10-13 16:11:55,995 [ERROR] taskflowbackend.api: Error running flow: Failed to remove collection

It's not the first time I've seen something like this, but I'd like us to try somehow handle these better. At least we should report the timeout instead of a generic "unable to remove collection", if at all possible. Catching the timeout exception and reporting back in timeline/zone status would be a start.

As for the failure itself, it would seem this is some kind of temporary network error. The iRODS server itself appears to be up and running just fine at this point and afterwards everything appears to have recovered without changes. The servers are running as docker containers in the same Docker Compose network, but each server is accessed by its FQDM. Could this just be a temporary DNS glitch?

Ideas are welcome.

bihealth / sodar-server

Improve handling of iRODS/taskflow failures with connection timeouts #2028