bihealth / sodar-server

SODAR: System for Omics Data Access and Retrieval
https://github.com/bihealth/sodar-server
MIT License
14 stars 3 forks source link

Timeline status update may fail on landing zone deletion #1894

Open mikkonie opened 9 months ago

mikkonie commented 9 months ago

This is a very random bug I've observed in production a couple of times (two times at the time of writing, to be exact):

I'll try to see how to reproduce this. Setting the zone status is the last task in the landing_zone_delete flow. A possible explanation could be an error/crash in run_flow() after the flow has returned. I have to look into server logs to see if any errors have been raised.

One possible might be an iRODS timeout similar to #1458 and #1893. OTOH, today I witnessed this happen in production at a time when iRODS connections were otherwise working without issues.

mikkonie commented 9 months ago

Still looking for a way to reproduce this. An interesting thing I observed is that in an affected project, other landing zone operations have been successfully triggered after the zone deletion. This means we reach the project unlock code in run_flow(), which is placed right after the timeline status updating.

Hence my interest is now targeted at this bit in run_flow():

# Flow completion
if flow_result and tl_event and async_mode:
    tl_event.set_status('OK', 'Async submit OK')
# Exception/failure
elif not flow_result and not ex_msg:
    ex_msg = UNKNOWN_RUN_ERROR

There is no error in the log. Based on that it looks like neither of these conditions are met. ex_msg is definitely not set at this point as we never raise an error. async_mode is always set when this flow is triggered. Likewise, tl_event should always be there as we'd get a crash if the tl_uuid query fails.

This is a weird one. I could add some debug logging to the aforementioned point in run_flow() to see if there is some unforeseen way these conditions can fail.. Or maybe I'm looking at the wrong thing entirely. Maybe we don't ever get this far in the execution of run_flow(), and the lock simply times out? But it definitely shouldn't time out that fast..

mikkonie commented 9 months ago

I queried our production database. Out of 901 deletions, in 15 the timeline event has gotten stuck in the SUBMIT state. Timewise, these range from November 2022 to this day. Interestingly, 6 of them happened in the same project on the same day (2023-10-25).

I'll try to come up with and test some variables which could affect this but if no luck, I'll leave it ongoing now and see if I can observe more of this in production.

Mental note: here's a twoliner to grab affected events in the Django shell.

from timeline.models import ProjectEvent
events = [e for e in ProjectEvent.objects.filter(event_name='zone_delete') if e.get_status().status_type == 'SUBMIT']

One thing I noticed that at least some of these zones in question (maybe all, I need to check) had previously failed with Failed to remove collection. That doesn't exactly explain why we wouldn't hit that bit of code though.

mikkonie commented 9 months ago

I'll keep my eye out on further occurrences of this issue. It might simply be because of an admin setting the deleted status in shell and neglecting to update/delete the related timeline status. If this is the case, the proper fix would be to enable proper admin tools. For that, see #1892.

This may well be the same issue as with #1798, so I'm setting that one to ongoing as well.