bihealth / sodar-server

SODAR: System for Omics Data Access and Retrieval
https://github.com/bihealth/sodar-server
MIT License
14 stars 3 forks source link

File potentially lost if iRODS connection times out during landing zone moving #1893

Open mikkonie opened 5 months ago

mikkonie commented 5 months ago

This is something I've observed happen in production a few times recently.

This requires the user to reupload the file into the zone before retrying moving the zone. Not the end of the world, but definitely inconvenient.

In observed practice, such iRODS timeouts are almost always due to a problem with the file system. However, I guess technically it could also happen if there was a network problem between iCAT and the resource server. In our production setup, the iCAT server runs on the same host as SODAR, but the resource server doesn't.

We should think of a solution to at least try to mitigate this on the SODAR side.

One possible approach might be to copy the files in landing_zone_move instead of moving them. Then the flow could delete the original files afterwards, once successful copying is confirmed. It temporarily takes more disk space, but that should not be a major issue. However, if it's possible for a copying to get terminated in the middle of a file, with an unfinished copy left in the project sample data, the flow needs to somehow work around that. The user, obviously, has no access to modify files in the sample data.

I'll have to think about this. Ideas and suggestions are of course welcome. A major problem in developing a solution is that we'd somehow have to simulate these ceph problems in the dev/test enviornment..