Landing zone status not always updated with iRODS connection errors

bihealth / sodar-server

SODAR: System for Omics Data Access and Retrieval

https://github.com/bihealth/sodar-server

MIT License

14 stars 3 forks source link

Landing zone status not always updated with iRODS connection errors #1458

Open mikkonie opened 2 years ago

mikkonie commented 2 years ago

@ericblanc20 reported a case where a landing zone transfer has failed with the following error mesasge:

Error running async flow: Could not receive server response

UPDATE 2024-02: This has been observed when e.g. the file system on an iRODS resource server is slow/unresponsive/offline. The iRODS connection eventually times out leading into this error. We get a failure in the timeline event, but the zone status remains locked in e.g. MOVING. We should be able to update the zone status where we catch the exception for the timeline event.

mikkonie commented 2 years ago

I've received further reports of this happening a week ago or so. Interestingly, all of these have failed once validation has succeeded and moving the files is about to begin.

Maybe it's possible the iRODS connection is for some reason getting shut down or timing out between these actions within the taskflow? I need to look into this further.

mikkonie commented 2 years ago

Even if this is a network error, this should return a proper error into the landing zone status. Currently the failure is only reported in timeline. If the landing zone was correctly set to FAILED, the user could retry later. I will take a look into implementing this once I'm done with all the Taskflow updates.

Furthermore, the user's write access to the landing zone remains disabled. This is not a huge issue, but prevents the user from e.g. adding new files even after the zone status is reset by an admin.

mikkonie commented 7 months ago

This has been recently happening again, so I updated the original description and will try to look into it for v0.14.2.

mikkonie commented 7 months ago

I found one possible point where this may happen. If for some reason the flow has not correctly reverted and we hit the end of run_flow(), the zone status is not updated there while everything else to end executing the flow is done. I added a force updating of the status if not yet set into FAILED or NOT CREATED.

I'm still not 100% certain why this could happen, maybe an uncaught iRODS exception in a flow/task which causes us not to revert properly? Or a bug in the "set status to FAILED on revert" task itself?

I'll have to monitor if this change does the trick or if something fancier is still needed.

mikkonie commented 7 months ago

I'll leave this issue into ongoing for now and monitor if these problems still occur somewhere.