Something I noticed in one of the projects on our production server. In one project, multiple landing zones have failed to delete with the error "failed to remove collection". iRODS log displays no errors during this time period.
Looking into Docker Compose logs, it seems connections from celeryd to iRODS have been timing out during this period:
It's not the first time I've seen something like this, but I'd like us to try somehow handle these better. At least we should report the timeout instead of a generic "unable to remove collection", if at all possible. Catching the timeout exception and reporting back in timeline/zone status would be a start.
As for the failure itself, it would seem this is some kind of temporary network error. The iRODS server itself appears to be up and running just fine at this point and afterwards everything appears to have recovered without changes. The servers are running as docker containers in the same Docker Compose network, but each server is accessed by its FQDM. Could this just be a temporary DNS glitch?
Something I noticed in one of the projects on our production server. In one project, multiple landing zones have failed to delete with the error "failed to remove collection". iRODS log displays no errors during this time period.
Looking into Docker Compose logs, it seems connections from celeryd to iRODS have been timing out during this period:
It's not the first time I've seen something like this, but I'd like us to try somehow handle these better. At least we should report the timeout instead of a generic "unable to remove collection", if at all possible. Catching the timeout exception and reporting back in timeline/zone status would be a start.
As for the failure itself, it would seem this is some kind of temporary network error. The iRODS server itself appears to be up and running just fine at this point and afterwards everything appears to have recovered without changes. The servers are running as docker containers in the same Docker Compose network, but each server is accessed by its FQDM. Could this just be a temporary DNS glitch?
Ideas are welcome.