e-mission / e-mission-docs

Repository for docs and issues. If you need help, please file an issue here. Public conversations are better for open source projects than private email.
https://e-mission.readthedocs.io/en/latest
BSD 3-Clause "New" or "Revised" License
15 stars 34 forks source link

Pipeline stuck for user because `TRIP_SEGMENTATION` did not mark 'completed' nor 'failed' #1075

Open JGreenlee opened 5 months ago

JGreenlee commented 5 months ago

The user has a 2-week backlog of draft trips, since May 7. I searched for error messages for the user in CloudWatch with the filter expression %ERROR.*uuidOfTheUser%

I found this error, which occurs repeatedly (presumably on every pipeline run):

ERROR:139822526199616:Found error curr_state.curr_run_ts = 1715124566.4760673 while processing pipeline for user █████, skipping

The error occurs shortly after INFO:139822526199616:**********UUID █████: segmenting into trips**********. The first time the error occurred was 2024-05-08T01:49:41.377Z.

My understanding of this is that there's a curr_run_ts left over from the previous time the TRIP_SEGMENTATION stage ran. Which means it exited improperly without being marked as 'completed' or 'failed'.

I checked logs of the previous run of the pipeline, but I couldn't find anything out of the ordinary.

shankari commented 5 months ago

This can also sometime happen if the pipeline takes a long time to run and a new instance is launched before the first is complete. At least on AWS, new scheduled tasks should be launched only when the previous one completes, but apparently sometimes this doesn't happen.

I typically use bin/monitor/find_invalid_pipeline_state.py and bin/monitor/reset_invalid_pipeline_states.py If you let me know which program the user is a part of, I can run those scripts and fix the issue

shankari commented 5 months ago

I reset the pipeline for this user, and it looked like it failed again during TRIP_SEGMENTATION.

shankari commented 5 months ago

We checked, and the pipeline for this user is still running, stuck in trip segmentation. It is currently processing data from the 18th.

2024-05-22 15:33:51,351:DEBUG:140473223255872:sample activities are [{'fmt_time': '2024-05-18T14:15:09.670000-04:00'}}]

In parallel, even if another run is launched, maybe because AWS thought that this one was "stuck", that run will fail in the trip segmentation stage, but the curr_run_ts will be cleared when the current run ends, so it should be OK.

Some potential fixes:

shankari commented 5 months ago

Looking more carefully, the trip segmentation just stopped at

2024-05-07T23:35:25.757Z    2024-05-07 23:35:25,757:DEBUG:139645930166080:finished querying values for ['background/motion_activity'], count = 1
    2024-05-07T23:35:25.758Z    2024-05-07 23:35:25,758:DEBUG:139645930166080:finished querying values for [], count = 0
    2024-05-07T23:35:25.759Z    2024-05-07 23:35:25,758:DEBUG:139645930166080:orig_ts_db_matches = 1, analysis_ts_db_matches = 0

There are no additional logs (from any user) in this log file.

And the next log file starts at May 8th 00:25.

2024-05-08T00:25:40.737Z    Found configuration, overriding...
2024-05-08T00:25:40.740Z    
    2024-05-08T00:25:40.775Z    {
    2024-05-08T00:25:40.775Z    "timeseries": {
    2024-05-08T00:25:40.775Z    "url": 
    2024-05-08T00:25:40.775Z    "result_limit": 250000
    2024-05-08T00:25:40.775Z    }
    2024-05-08T00:25:40.775Z    }
    2024-05-08T00:25:40.775Z    0.0.0.0
    2024-05-08T00:25:40.779Z    {

My best guess is that the task was killed because it was taking up too much memory and then just re-launched at its usual time, which appears to be around :25 after the hour.

Screenshot 2024-05-22 at 9 07 08 AM

We can mitigate some of this by reading only a limited number of entries at a time. It may be that one of the other users that was running in parallel read a huge amount of data.

So to mitigate: