Open JGreenlee opened 5 months ago
This can also sometime happen if the pipeline takes a long time to run and a new instance is launched before the first is complete. At least on AWS, new scheduled tasks should be launched only when the previous one completes, but apparently sometimes this doesn't happen.
I typically use bin/monitor/find_invalid_pipeline_state.py
and bin/monitor/reset_invalid_pipeline_states.py
If you let me know which program the user is a part of, I can run those scripts and fix the issue
I reset the pipeline for this user, and it looked like it failed again during TRIP_SEGMENTATION
.
We checked, and the pipeline for this user is still running, stuck in trip segmentation. It is currently processing data from the 18th.
2024-05-22 15:33:51,351:DEBUG:140473223255872:sample activities are [{'fmt_time': '2024-05-18T14:15:09.670000-04:00'}}]
In parallel, even if another run is launched, maybe because AWS thought that this one was "stuck", that run will fail in the trip segmentation stage, but the curr_run_ts
will be cleared when the current run ends, so it should be OK.
Some potential fixes:
bin/monitor/reset_invalid_pipeline_states.py
). I currently run it manually periodically, but we can run it automatically using a scheduled task maybe once a day. I have thought about that, but have been concerned about getting into an infinite loop with long-running tasks that take more than a day because a the end of the day, we think is stuck and reset it, so it never gets a chance to complete. We may be able to fix this by improving the script and adding in additional fail-safes.
curr_run_ts
was set more than X time ago, and then assumes that if it is, then the pipeline is stuck, and rolls back to before that set point by deleting any entries after that since they are expected to be partially done.Looking more carefully, the trip segmentation just stopped at
2024-05-07T23:35:25.757Z 2024-05-07 23:35:25,757:DEBUG:139645930166080:finished querying values for ['background/motion_activity'], count = 1
2024-05-07T23:35:25.758Z 2024-05-07 23:35:25,758:DEBUG:139645930166080:finished querying values for [], count = 0
2024-05-07T23:35:25.759Z 2024-05-07 23:35:25,758:DEBUG:139645930166080:orig_ts_db_matches = 1, analysis_ts_db_matches = 0
There are no additional logs (from any user) in this log file.
And the next log file starts at May 8th 00:25.
2024-05-08T00:25:40.737Z Found configuration, overriding...
2024-05-08T00:25:40.740Z
2024-05-08T00:25:40.775Z {
2024-05-08T00:25:40.775Z "timeseries": {
2024-05-08T00:25:40.775Z "url":
2024-05-08T00:25:40.775Z "result_limit": 250000
2024-05-08T00:25:40.775Z }
2024-05-08T00:25:40.775Z }
2024-05-08T00:25:40.775Z 0.0.0.0
2024-05-08T00:25:40.779Z {
My best guess is that the task was killed because it was taking up too much memory and then just re-launched at its usual time, which appears to be around :25 after the hour.
We can mitigate some of this by reading only a limited number of entries at a time. It may be that one of the other users that was running in parallel read a huge amount of data.
So to mitigate:
The user has a 2-week backlog of draft trips, since May 7. I searched for error messages for the user in CloudWatch with the filter expression
%ERROR.*uuidOfTheUser%
I found this error, which occurs repeatedly (presumably on every pipeline run):
ERROR:139822526199616:Found error curr_state.curr_run_ts = 1715124566.4760673 while processing pipeline for user █████, skipping
The error occurs shortly after
INFO:139822526199616:**********UUID █████: segmenting into trips**********
. The first time the error occurred was2024-05-08T01:49:41.377Z
.My understanding of this is that there's a
curr_run_ts
left over from the previous time theTRIP_SEGMENTATION
stage ran. Which means it exited improperly without being marked as 'completed' or 'failed'.I checked logs of the previous run of the pipeline, but I couldn't find anything out of the ordinary.