Pipeline stuck for user because `TRIP_SEGMENTATION` did not mark 'completed' nor 'failed'

JGreenlee commented 5 months ago

The user has a 2-week backlog of draft trips, since May 7. I searched for error messages for the user in CloudWatch with the filter expression %ERROR.*uuidOfTheUser%

I found this error, which occurs repeatedly (presumably on every pipeline run):

ERROR:139822526199616:Found error curr_state.curr_run_ts = 1715124566.4760673 while processing pipeline for user █████, skipping

The error occurs shortly after INFO:139822526199616:**********UUID █████: segmenting into trips**********. The first time the error occurred was 2024-05-08T01:49:41.377Z.

My understanding of this is that there's a curr_run_ts left over from the previous time the TRIP_SEGMENTATION stage ran. Which means it exited improperly without being marked as 'completed' or 'failed'.

I checked logs of the previous run of the pipeline, but I couldn't find anything out of the ordinary.

shankari commented 5 months ago

This can also sometime happen if the pipeline takes a long time to run and a new instance is launched before the first is complete. At least on AWS, new scheduled tasks should be launched only when the previous one completes, but apparently sometimes this doesn't happen.

I typically use bin/monitor/find_invalid_pipeline_state.py and bin/monitor/reset_invalid_pipeline_states.py If you let me know which program the user is a part of, I can run those scripts and fix the issue

shankari commented 5 months ago

I reset the pipeline for this user, and it looked like it failed again during TRIP_SEGMENTATION.

shankari commented 5 months ago

We checked, and the pipeline for this user is still running, stuck in trip segmentation. It is currently processing data from the 18th.

2024-05-22 15:33:51,351:DEBUG:140473223255872:sample activities are [{'fmt_time': '2024-05-18T14:15:09.670000-04:00'}}]

In parallel, even if another run is launched, maybe because AWS thought that this one was "stuck", that run will fail in the trip segmentation stage, but the curr_run_ts will be cleared when the current run ends, so it should be OK.

Some potential fixes:

improve the scalability of trip_segmentation by reducing the number of database calls. There is a PR out for this that is close to being done https://github.com/e-mission/e-mission-server/pull/958
there is also a script that detects "badly stuck" pipelines and resets them (bin/monitor/reset_invalid_pipeline_states.py). I currently run it manually periodically, but we can run it automatically using a scheduled task maybe once a day. I have thought about that, but have been concerned about getting into an infinite loop with long-running tasks that take more than a day because a the end of the day, we think is stuck and reset it, so it never gets a chance to complete. We may be able to fix this by improving the script and adding in additional fail-safes.
- This basically sees if a curr_run_ts was set more than X time ago, and then assumes that if it is, then the pipeline is stuck, and rolls back to before that set point by deleting any entries after that since they are expected to be partially done.

shankari commented 5 months ago

Looking more carefully, the trip segmentation just stopped at

2024-05-07T23:35:25.757Z    2024-05-07 23:35:25,757:DEBUG:139645930166080:finished querying values for ['background/motion_activity'], count = 1
    2024-05-07T23:35:25.758Z    2024-05-07 23:35:25,758:DEBUG:139645930166080:finished querying values for [], count = 0
    2024-05-07T23:35:25.759Z    2024-05-07 23:35:25,758:DEBUG:139645930166080:orig_ts_db_matches = 1, analysis_ts_db_matches = 0

There are no additional logs (from any user) in this log file.

And the next log file starts at May 8th 00:25.

2024-05-08T00:25:40.737Z    Found configuration, overriding...
2024-05-08T00:25:40.740Z    
    2024-05-08T00:25:40.775Z    {
    2024-05-08T00:25:40.775Z    "timeseries": {
    2024-05-08T00:25:40.775Z    "url": 
    2024-05-08T00:25:40.775Z    "result_limit": 250000
    2024-05-08T00:25:40.775Z    }
    2024-05-08T00:25:40.775Z    }
    2024-05-08T00:25:40.775Z    0.0.0.0
    2024-05-08T00:25:40.779Z    {

My best guess is that the task was killed because it was taking up too much memory and then just re-launched at its usual time, which appears to be around :25 after the hour.

We can mitigate some of this by reading only a limited number of entries at a time. It may be that one of the other users that was running in parallel read a huge amount of data.

So to mitigate:

make trip segmentation faster
read data for processing in batches
auto-recover from bad situations (self-healing systems are the best)

e-mission / e-mission-docs

Pipeline stuck for user because `TRIP_SEGMENTATION` did not mark 'completed' nor 'failed' #1075