File potentially lost if iRODS connection times out during landing zone moving

This is something I've observed happen in production a few times recently.

IF the landing_zone_move flow is in the MOVING state
AND our iRODS connection times out with Could not receive server response
THEN a single file may afterwards exist neither in the zone or in the project sample data
- Presumably, this is the file being moved at the time of the timeout

This requires the user to reupload the file into the zone before retrying moving the zone. Not the end of the world, but definitely inconvenient.

In observed practice, such iRODS timeouts are almost always due to a problem with the file system. However, I guess technically it could also happen if there was a network problem between iCAT and the resource server. In our production setup, the iCAT server runs on the same host as SODAR, but the resource server doesn't.

We should think of a solution to at least try to mitigate this on the SODAR side.

One possible approach might be to copy the files in landing_zone_move instead of moving them. Then the flow could delete the original files afterwards, once successful copying is confirmed. It temporarily takes more disk space, but that should not be a major issue. However, if it's possible for a copying to get terminated in the middle of a file, with an unfinished copy left in the project sample data, the flow needs to somehow work around that. The user, obviously, has no access to modify files in the sample data.

I'll have to think about this. Ideas and suggestions are of course welcome. A major problem in developing a solution is that we'd somehow have to simulate these ceph problems in the dev/test enviornment..

bihealth / sodar-server

File potentially lost if iRODS connection times out during landing zone moving #1893