Investigate pipeline stage updates and FUZZ factor in detail

From https://github.com/e-mission/e-mission-docs/issues/806#issuecomment-1269297364

when we delete, we get

2022-10-05 14:29:18,761:DEBUG:4358970880:last_place_enter_ts = 1469231473
2022-10-05 14:29:18,761:DEBUG:4358970880:reset_ts = 1469231473

>>> arrow.get(1469231473).to("America/Los_Angeles")
<Arrow [2016-07-22T16:51:13-07:00]>

and we delete a bunch of entries

2022-10-05 14:29:18,853:INFO:4358970880:About to delete 108 entries
2022-10-05 14:29:18,947:INFO:4358970880:About to delete entries with keys ['analysis/cleaned_place', 'analysis/cleaned_section', 'analysis/cleaned_trip', 'analysis/cleaned_untracked', 'analysis/confirmed_trip', 'analysis/expected_trip', 'analysis/inferred_labels', 'analysis/inferred_trip', 'analysis/recreated_location', 'inference/labels', 'segmentation/raw_place', 'segmentation/raw_section', 'segmentation/raw_trip', 'segmentation/raw_untracked']

Before the reset, the trips were

2022-10-05 14:29:17,939:DEBUG:4358970880:Comparing 2016-07-22T09:26:08-07:00 -> 2016-07-22T11:44:50-07:00 with 2016-07-22T09:26:08-07:00 -> 2016-07-22T11:44:50-07:00
2022-10-05 14:29:17,939:DEBUG:4358970880:Comparing 2016-07-22T16:38:06.693176-07:00 -> 2016-07-22T16:51:13-07:00 with 2016-07-22T16:38:06.693176-07:00 -> 2016-07-22T16:51:13-07:00
2022-10-05 14:29:17,939:DEBUG:4358970880:Comparing 2016-07-22T16:59:46.998718-07:00 -> 2016-07-22T17:05:52.818000-07:00 with 2016-07-22T16:59:46.998718-07:00 -> 2016-07-22T17:05:52.818000-07:00
2022-10-05 14:29:17,939:DEBUG:4358970880:Comparing 2016-07-22T17:12:19.019962-07:00 -> 2016-07-22T17:21:36-07:00 with 2016-07-22T17:12:19.019962-07:00 -> 2016-07-22T17:21:36-07:00

So 16:51 is an eminently fine place to end.

We will then retain this trip because the start is not greater than the reset. And then, because of the lte, the trip is returned along with the newly created ones, and we get our ever-increasing list.

I can even understand why this is not a problem for raw trips, but not sure why it is not for cleaned trips

2022-10-05 14:29:17,939:DEBUG:4358970880:Comparing 2016-07-22T16:38:06.693176-07:00 -> 2016-07-22T16:51:13-07:00 with 2016-07-22T16:38:06.693176-07:00 -> 2016-07-22T16:51:13-07:00

Query to get raw trips while cleaning

2022-10-05 14:29:22,447:DEBUG:4358970880:curr_query = {'user_id': UUID('75403cf7-d128-4349-8834-47b9c5ec375c'), '$or': [{'metadata.key': 'segmentation/raw_trip'}, {'metadata.key': 'segmentation/raw_untracked'}], 'data.start_ts': {'$lte': 1665005357.4372952, '$gte': 1469231473}}, sort_key = data.start_ts

Query to get cleaned trips for label inference

2022-10-05 14:29:23,777:DEBUG:4358970880:curr_query = {'user_id': UUID('75403cf7-d128-4349-8834-47b9c5ec375c'), '$or': [{'metadata.key': 'analysis/cleaned_trip'}], 'data.end_ts': {'$lte': 1665005358.773799, '$gte': 1469231473}}, sort_key = data.end_ts
2022-10-05 14:29:23,779:DEBUG:4358970880:finished querying values for ['analysis/cleaned_trip'], count = 5

So I actually commented out the latter part of the test and manually printed out the raw_trip, cleaned_trip and confirmed_trip and they all have the same

                     start_fmt_time               end_fmt_time
0         2016-07-22T09:26:08-07:00  2016-07-22T11:44:50-07:00
1  2016-07-22T16:40:23.024000-07:00  2016-07-22T16:51:13-07:00

                     start_fmt_time               end_fmt_time
0         2016-07-22T09:26:08-07:00  2016-07-22T11:44:50-07:00
1  2016-07-22T16:38:06.693176-07:00  2016-07-22T16:51:13-07:00

                     start_fmt_time               end_fmt_time
0         2016-07-22T09:26:08-07:00  2016-07-22T11:44:50-07:00
1  2016-07-22T16:38:06.693176-07:00  2016-07-22T16:51:13-07:00

So why do we not get duplicate cleaned trips after the reset?
how does all this interact with the FUZZ factor from https://github.com/e-mission/e-mission-docs/issues/191#issuecomment-312917119?

Have worked around this for now in https://github.com/e-mission/e-mission-server/pull/879, but need a more principled investigation.

One more pipeline investigation + potential fix. When we reset the pipeline, we set the raw_places array in the last cleaned place to empty, with the comment:

Note that we need to reset the raw_place array since it will be repopulated with new squished places when the timeline after the entry to this place is reconstructed

But is that actually correct? We do in fact still have the last raw place retained.

    reset_ts, last_cleaned_place, last_raw_place = get_reset_ts(user_id, last_cleaned_place, is_dry_run)

    # clear all analysis results after it
    del_objects_after(user_id, reset_ts, is_dry_run)

    # open the raw and cleaned places
    reset_last_place(last_cleaned_place, is_dry_run)
    reset_last_place(last_raw_place, is_dry_run)

Should we maintain at least that link?

Note that if we don't maintain that link, then running back to back resets fails because there are no raw places. And we have to implement a hack where we find that raw place and fill it in temporarily in memory.

While running with that fix, encountered another error

Traceback (most recent call last):
  File "bin/monitor/reset_invalid_pipeline_states.py", line 25, in <module>
    reset_all_invalid_state(args)
  File "bin/monitor/reset_invalid_pipeline_states.py", line 15, in reset_all_invalid_state
    epr.auto_reset(args.dry_run, args.only_calc)
  File "/Users/kshankar/e-mission/nrel-db-connect/emission/pipeline/reset.py", line 357, in auto_reset
    reset_user_to_ts(invalid_state['user_id'], invalid_state['reset_ts'], dry_run)
  File "/Users/kshankar/e-mission/nrel-db-connect/emission/pipeline/reset.py", line 73, in reset_user_to_ts
    reset_ts, last_cleaned_place, last_raw_place = get_reset_ts(user_id, last_cleaned_place, is_dry_run)
  File "/Users/kshankar/e-mission/nrel-db-connect/emission/pipeline/reset.py", line 99, in get_reset_ts
    ending_raw_trip = esda.get_entry(esda.RAW_TRIP_KEY, ending_trip["data"]["raw_trip"])
KeyError: 'raw_trip'

Not sure how we have a cleaned trip without a raw trip; maybe there was an error while saving the cleaned trip? We can work around by just resetting to an earlier time; but let's investigate a bit more before doing that.

Another issue with the pipeline: if sectioning steps fail, but the subsequent clean_and_resample passes, then we will end up with unknown and uncleaned trips. It didn't actually happen in this case, but it is a concern??!

Concretely:

segmenting trips completes (last_processed_ts = (say) 2022-10-01)
segmenting sections errors out (last_processed_ts = (say) 2022-09-30)
clean and resample reads raw trips and finds them but does not find any sections. Let's say that it completes successfully. At this point, if we reset the pipeline, we will do so only to the last cleaned place. When we run the section segmentation, it will complete successfully, but the new sections will not be incorporated in the cleaned trips because they are done.

e-mission / e-mission-docs

Investigate pipeline stage updates and FUZZ factor in detail #808