The storage system we use for iRODS in production is experiencing a lot of performance issues. This results in checksum calculation errors as iRODS becomes unable to read files as required. SODAR gets the blame for that, which is factually incorrect but understandable, as it's the part of the system most visible to the user.
Taskflowbackend is able to recover from these crashes and continue the operation, so failing to calculate a single checksum will not stop the landing_zone_move flow execution. Alas, once we get to the actual validation part, the execution of the flow will naturally fail, as all the checksums have not been correctly computed.
Restarting the flow does often (albeit not reliably) help, as the storage system may have recovered from its issues in the meantime.
Hence, it could be tried to add a retry of N times to calculate a checksum in case it fails due to a temporal server failure.
This is, obviously, a workaround and a hack. The proper solution involves improving the storage backend. But if this does end up helping with the case of failed validations, it could be an acceptable temporary solution with a simple implementation. Might as well give it a shot.
The storage system we use for iRODS in production is experiencing a lot of performance issues. This results in checksum calculation errors as iRODS becomes unable to read files as required. SODAR gets the blame for that, which is factually incorrect but understandable, as it's the part of the system most visible to the user.
Taskflowbackend is able to recover from these crashes and continue the operation, so failing to calculate a single checksum will not stop the
landing_zone_move
flow execution. Alas, once we get to the actual validation part, the execution of the flow will naturally fail, as all the checksums have not been correctly computed.Restarting the flow does often (albeit not reliably) help, as the storage system may have recovered from its issues in the meantime.
Hence, it could be tried to add a retry of N times to calculate a checksum in case it fails due to a temporal server failure.
This is, obviously, a workaround and a hack. The proper solution involves improving the storage backend. But if this does end up helping with the case of failed validations, it could be an acceptable temporary solution with a simple implementation. Might as well give it a shot.