bihealth / sodar-server

SODAR: System for Omics Data Access and Retrieval
https://github.com/bihealth/sodar-server
MIT License
14 stars 3 forks source link

Add retrying for BatchCalculateChecksumTask exceptions #1941

Closed mikkonie closed 5 months ago

mikkonie commented 7 months ago

The storage system we use for iRODS in production is experiencing a lot of performance issues. This results in checksum calculation errors as iRODS becomes unable to read files as required. SODAR gets the blame for that, which is factually incorrect but understandable, as it's the part of the system most visible to the user.

Taskflowbackend is able to recover from these crashes and continue the operation, so failing to calculate a single checksum will not stop the landing_zone_move flow execution. Alas, once we get to the actual validation part, the execution of the flow will naturally fail, as all the checksums have not been correctly computed.

Restarting the flow does often (albeit not reliably) help, as the storage system may have recovered from its issues in the meantime.

Hence, it could be tried to add a retry of N times to calculate a checksum in case it fails due to a temporal server failure.

This is, obviously, a workaround and a hack. The proper solution involves improving the storage backend. But if this does end up helping with the case of failed validations, it could be an acceptable temporary solution with a simple implementation. Might as well give it a shot.

mikkonie commented 5 months ago

Done. It remains to be seen if this actually helps in production. This is one of those things which is not exactly trivial to test in dev.