e-mission / e-mission-docs

Repository for docs and issues. If you need help, please file an issue here. Public conversations are better for open source projects than private email.
https://e-mission.readthedocs.io/en/latest
BSD 3-Clause "New" or "Revised" License
15 stars 34 forks source link

No processed trips for a few days; buildup of draft trips #942

Closed JGreenlee closed 1 year ago

JGreenlee commented 1 year ago

I had a busy day of travel on Saturday (3 days ago).

My last trip showing up as processed was 3:34-3:44 PM. All trips from that point onward (approx. 30) are draft trips.

Processing should happen once every hour and we should never see a buildup of unprocessed travel this long. This could be a sync issue on the phone, or it could be a processing issue on the server.

JGreenlee commented 1 year ago

Local logs indicate an error is being thrown in the sync plugin: ServerSyncAdapter : IO Error java.io.IOException while posting converted trips to JSON, which is coming from https://github.com/e-mission/cordova-server-sync/blob/01c4e4cb7cd9bfd1b86f451d184ae187139ab7e2/src/android/ServerSyncAdapter.java#L148

This could be the cause; however other recent logs would indicate sync is still happening successfully.

JGreenlee commented 1 year ago

I'll also note that there was significant travel in the gap between 3:44 and 8:42 PM. I remember my phone battery was low most of the day, so battery saver was probably on and off at different parts of the day.

shankari commented 1 year ago

I remember my phone battery was low most of the day, so battery saver was probably on and off at different parts of the day.

It would be good to see if this is in fact the issue. I have been testing with battery saver off and on (#940) and the trajectories may be a bit worse, but I don't see any big gaps in sensing.

shankari commented 1 year ago

To clarify, this was because the servers had been very unstable all last week and were crashing multiple times. One of the crashes must have occurred when the pipeline for this user was running. Once it crashed while running, all subsequent runs would crash because curr_run_ts was set

I reset the pipeline and the user was able to see their trips.

shankari commented 1 year ago

This occurred again yesterday. The servers crashed at around 4:18pm and then again at around 5:14pm.

Since this is the second time this has happened, I wrote a script to iterate through all the active production deployments and check how many invalid pipeline states we had.

import os
import re
import copy
import json

storage_conf_template = json.load(open("conf/storage/db.conf"))
print(storage_conf_template.keys())

for prod in PROD_LIST:
    prod_db_name = prod.replace("-", "_")
    print(f"Resetting pipeline for {prod=} with {prod_db_name=}")
    curr_storage_conf = copy.deepcopy(storage_conf_template)
    curr_storage_conf['timeseries']['url'] = curr_storage_conf['timeseries']['url'].replace("REPLACEME", prod_db_name)
    print(f"After resetting, {storage_conf_template=}, {curr_storage_conf=}")
    json.dump(curr_storage_conf, open("conf/storage/db.conf", "w"))
    os.system("echo conf/storage.db.conf")
    check_reset_cmd = "./e-mission-py.bash bin/monitor/reset_invalid_pipeline_states.py -c"
    os.system(check_reset_cmd)

json.dump(storage_conf_template, open("conf/storage/db.conf", "w"))
shankari commented 1 year ago

Number of users that need to be reset varies from ~ 1 to ~ 5. Starting with the ones in an attempt to spread out the load. I wish it were possible to check the CPU load directly..

shankari commented 1 year ago

Cloud Services indicated that the issue was related to high CPU load. Let us look at some of the long tasks that they flagged. One of the long tasks was from denver_casr

Looking at the logs for one hour (attached), we notice that:

$ wc -l /tmp/one_hour_of_denver_casr
     126 /tmp/one_hour_of_denver_casr

However, 94 of them did not process any data

$ grep unchanged /tmp/one_hour_of_denver_casr | wc -l
      94

and only 31 actually had data to process

$ grep "last_ts_processed = " /tmp/one_hour_of_denver_casr | wc -l
      31

Even the ones that did not have data to process took about 30 seconds each

"2023-08-10 23:41:35,463:INFO:139958568953664:For stage PipelineStages.USERCACHE, last_ts_processed is unchanged"
"2023-08-10 23:42:23,039:INFO:139958568953664:For stage PipelineStages.USERCACHE, last_ts_processed is unchanged"
"2023-08-10 23:42:58,393:INFO:139958568953664:For stage PipelineStages.USERCACHE, last_ts_processed is unchanged"
"2023-08-10 23:43:19,324:INFO:139958568953664:For stage PipelineStages.USERCACHE, last_ts_processed is unchanged"
"2023-08-10 23:43:52,610:INFO:140488176322368:For stage PipelineStages.USERCACHE, last_ts_processed is unchanged"
"2023-08-10 23:44:05,465:INFO:139958568953664:For stage PipelineStages.USERCACHE, last_ts_processed is unchanged"

But that means that it must have taken ~ 47 minutes just to process and skip those users with no data

>>> (30 * 94) / 60
47

A simple fix seems to be that, if the user has no incoming data (e.g. nothing moved to long_term), we can skip the rest of the pipeline.

The problem with that approach is that then, if there are downstream pipeline stages that have not processed fully - e.g. due to a crash, and then we reset the pipeline, the downstream stages will not be processed until new data comes in. And if there is no new data coming in (the user has uninstalled the app), then we will never reprocess the data.

However, it seems like a fairly simple fix is to pass in a flag indicating whether to bail out early or not. We can set it to true (bail out early) on production for now, and consider running with the flag turned off (don't bail early) on a separate "catch up" AWS instance. Or turn the flag off after we discuss other options for scalability.... Or...

one_hour_of_denver_casr.gz

shankari commented 1 year ago

This is a super simple fix which I will plan on pushing to staging tomorrow

shankari commented 1 year ago

While we are looking at early returns from the pipeline, let's see if we can do something similar for the trip model as well. If we look at the Durham logs, for example, the pipeline ran for 30 mins even though there was no actual data coming in.

2023-08-10 02:00:52,228
2023-08-10 02:24:41,112

The current trip model pipeline is not incremental, so we cannot look to see if there are "new" trips. However, it is also run once a day, so if the last trip is more than a day old (or ~ 3 days old to add in some margin) we can do an early return.

shankari commented 1 year ago

The logs for Durham are:

2023-08-10 02:00:52,228:INFO:building model for user
2023-08-10 02:00:52,228:DEBUG:getting key model_type in config
2023-08-10 02:00:52,228:DEBUG:getting key model_storage in config
2023-08-10 02:00:52,228:DEBUG:getting key minimum_trips in config
2023-08-10 02:00:52,228:DEBUG:getting key model_parameters.greedy in config
2023-08-10 02:00:52,228:DEBUG:GreedySimilarityBinning loaded model config from file
2023-08-10 02:00:54,309:DEBUG:retrieved latest trip model recorded at timestamp 1691546460.3530114
2023-08-10 02:00:54,369:INFO:For stage PipelineStages.TRIP_MODEL, start_ts = 2022-11-01T23:29:05.438816
2023-08-10 02:00:54,389:DEBUG:time query for training data collection: None
2023-08-10 02:00:54,993:DEBUG:found 38 labeled trips for user
2023-08-10 02:00:56,036:DEBUG:upsert_doc called with key inference/trip_model
2023-08-10 02:00:56,292:DEBUG:GREEDY_SIMILARITY_BINNING label prediction model built for user with timestamp 1667345340.438816
2023-08-10 02:00:56,311:INFO:building model for user ....

So the run took ~ 4 seconds and ~ 2 seconds were spent in loading the model. Unfortunately, we load the model before loading the trips.

So we have two choices:

  1. add a new check before reading the model to get the last recorded trip (by using ts.get_first_value_for_field)
  2. move the trip load before the model load

The first feels more hacky, but is also likely to be more performant, especially for longer data collection durations, since we are not currently using an incremental model. Once we switch to an incremental model, we might want to implement (2) since it is more principled and less hacky.

@humbleOldSage for visibility and to keep in mind while switching models in the future.

shankari commented 1 year ago

Cloud services bumped up the server size, so I reset everything and reran. As of yesterday, there were just some slow-running pipelines for masscec and open access.

But today, we ran into a new entry in ccebikes which was stuck at the USER_INPUT stage.

On checking, it really was taking a lot of time to process the user inputs (I guess the user labeled a lot of trips at once?) I was monitoring the run, and noticed suddenly that the task seemed to hang/terminated. So I assumed that the tasks were terminating randomly.

However, there were a few more logs when I came back to take a look at it, and it looked like the saves were taking ~ 20 minutes each

1691861491149,"2023-08-12 17:31:31,149:DEBUG:139623946450752:update called"
1691861491149,"2023-08-12 17:31:31,149:DEBUG:139623946450752:Saving entry Entry({'_id': ObjectId('64d06524fc6a934b4ab56230'), 'metadata': {'key': 'analysis/confirmed_trip',

1691861981828,"2023-08-12 17:39:41,825:DEBUG:139623946450752:update called"
1691861981828,"2023-08-12 17:39:41,828:DEBUG:139623946450752:Saving entry Entry({'_id': ObjectId('64d06598fc6a934b4ab56234'), 'metadata': {'key': 'analysis/composite_trip',

1691861982575,"2023-08-12 17:39:42,575:DEBUG:139623946450752:curr_query = {'invalid': {'$exists': False}, ...

The really weird thing is that the confirmed trip saves were taking 20 mins, but the subsequent composite trip saves were taking almost no time. What are we doing between those two calls?

1691863387241,"2023-08-12 18:03:07,240:DEBUG:139623946450752:update called"
1691863387241,"2023-08-12 18:03:07,240:DEBUG:139623946450752:Saving entry Entry({'_id': ObjectId('64ce9c41618d03e1e6a41e95'), 'user_id': UUID('1355e661-e7fa-4450-aae8-dd828fca35e6'), 'metadata': {'key': 'analysis/confirmed_trip',

1691864075195,"2023-08-12 18:14:35,195:DEBUG:139623946450752:update called"
1691864075197,"2023-08-12 18:14:35,197:DEBUG:139623946450752:Saving entry Entry({'_id': ObjectId('64ce9dac618d03e1e6a41e98'), 'metadata': {'key': 'analysis/composite_trip',

1691864075717,"2023-08-12 18:14:35,717:DEBUG:139623946450752:curr_query = {'invalid': {'$exists': False}, ...
shankari commented 1 year ago

So the update and saving calls happen together

        logging.debug("update called")
        ts = esta.TimeSeries.get_time_series(entry.user_id)
        logging.debug("Saving entry %s into timeseries" % entry)
        edb.save(ts.get_timeseries_db(entry.metadata.key), entry)

I assume that the save is not slow in general, since the composite trip call is fast, although the composite trip is a bigger object. We need to see what (if anything) we are doing between the two saves.

shankari commented 1 year ago

Bingo! Here's the code.

We first update the confirmed object, and then we find the composite trip by find the trip for which this is the confirmed_trip.

    estbt.BuiltinTimeSeries.update(confirmed_obj)
    # we update confirmed object in the pipeline, but they are no longer the terminal data structures
    # that we display. So when the confirmed objects change, we need to find the corresponding
    # terminal objects (composite trips) and update them as well
    # if we don't find a matching composite trip, we don't need to do anything
    # since it has not been created yet and will be created with updated values when we get to that stage
    if confirmed_obj["metadata"]["key"] in [esda.CONFIRMED_TRIP_KEY, esda.CONFIRMED_UNTRACKED_KEY]:
        composite_trip = edb.get_analysis_timeseries_db().find_one({"data.confirmed_trip": confirmed_obj.get_id()})
        if composite_trip is not None:
            # copy over all the fields other than the end_confimed_place
            EXCLUDED_FIELDS = ["end_confirmed_place"]
            for k in confirmed_obj["data"].keys():
                if k not in EXCLUDED_FIELDS:
                    composite_trip["data"][k] = confirmed_obj["data"][k]
            estbt.BuiltinTimeSeries.update(ecwe.Entry(composite_trip))
        else:
            logging.debug("No composite trip matching confirmed trip %s, nothing to update" % confirmed_obj["_id"])

But there is no index on the field data.confirmed_trip!! So that query is taking 20 mins each on some of the databases

def _create_analysis_result_indices(tscoll):
    tscoll.create_index([("metadata.key", pymongo.ASCENDING)])

    # trips and sections
    tscoll.create_index([("data.start_ts", pymongo.DESCENDING)], sparse=True)
    tscoll.create_index([("data.end_ts", pymongo.DESCENDING)], sparse=True)
    _migrate_sparse_to_dense(tscoll, "data.start_loc_2dsphere")
    _migrate_sparse_to_dense(tscoll, "data.end_loc_2dsphere")
    tscoll.create_index([("data.start_loc", pymongo.GEOSPHERE)])
    tscoll.create_index([("data.end_loc", pymongo.GEOSPHERE)])
    _create_local_dt_indices(tscoll, "data.start_local_dt")
    _create_local_dt_indices(tscoll, "data.end_local_dt")

@JGreenlee for visibility

shankari commented 1 year ago

that stage finally completed and the thread has moved on. It doesn't show up as needing reset at this point.

shankari commented 1 year ago

There are a couple of potential fixes for the above.

  1. We can add an index for the data.confirmed_trip
  2. We can also load all composite trips for the specified time range, and then just match in memory. (1) seems easier at first glance, but I think that (2) is actually better because we should generally optimize timeseries accesses over field-based links
shankari commented 1 year ago

I noticed also that several of the users are stuck in stage 2 (section segmentation), not just trip segmentation Checking the logs, it looks like several calls which involve indices are taking multiple minutes to run

2023-08-13 16:28:15,503:DEBUG:139939849852736:curr_query = {'invalid': {'$exists': False}, 'user_id': UUID(...), '$or': [{'metadata.key': 'background/motion_activity'}], 'data.ts': {'$lte': 1671926016.0008862, '$gte': 1671925742.2325292}}, sort_key = data.ts

2023-08-13 16:31:20,377:DEBUG:139939849852736:curr activity = Motionactivity({'unknown': False, 'cycling': False, 'automotive': False, 'ts': 1671925801.0545607, 'confidence': 100, 'stationary': False, 'walking': False, 'running': False, 'type': 9, 'confidence_level': 'high', 'fmt_time': '2022-12-24T18:50:01.054561-05:00', 'local_dt_year': 2022, 'local_dt_month': 12, 'local_dt_day': 24, 'local_dt_hour': 18, 'local_dt_minute': 50, 'local_dt_second': 1, 'local_dt_weekday': 5, 'local_dt_timezone': 'America/New_York', '_id': ObjectId('63a7910680ea0c211d6aeb81'), 'user_id': ..., 'metadata_write_ts': 1671925801.0545607}), returning False

Because of this, for a particular masscec user, although the trip segmentation is at July 2023, the section segmentation is still stuck at September 2022

Even looking up one entry by the timestamp, when the data.ts is indexed, is taking 30 seconds

2023-08-13 16:54:14,216:DEBUG:140481331697472:get_entry_at_ts query = {'user_id': UUID('...'), 'metadata.key': 'background/filtered_location', 'data.ts': 1663716286.000039}

2023-08-13 16:54:35,776:DEBUG:140481331697472:get_entry_at_ts result = {'_id': ObjectId('63321b30aa2dde12ed3db589'), 'user_id': UUID('...'), 'metadata': {'time_zone': 'America/New_York', 'plugin': 'none', 'platform': 'ios', 'read_ts': 0, 'key': 'background/filtered_location', 'type': 'sensor-data',...
shankari commented 1 year ago

Documenting design decisions: for https://github.com/e-mission/e-mission-docs/issues/942#issuecomment-1674108246 if there are no new entries, we don't modify the pipelinestate at all. That seems fairly consistent and neat... So if the pipeline state didn't exist before, it won't be created. And the last_run_ts won't change.

shankari commented 1 year ago

wrt

We can also load all composite trips for the specified time range, and then just match in memory. (1) seems easier at first glance, but I think that (2) is actually better because we should generally optimize timeseries accesses over field-based links

from https://github.com/e-mission/e-mission-docs/issues/942#issuecomment-1676108597, this is a bit tricky because the range of affected confirmed/composite trips can be anything - people can go back and label trips for a year ago today.

So we can fix this by first determining the confirmed trips that are affected, and then pulling all the composite trips in that time range and matching them up. This can still have bad performance for a user who (a) has a lot of trips, and (b) labels trips across a wide range (e.g. one trip a year ago and one trip from a week ago)

An alternative is to find the composite trip using the start_ts and not the id - the start_ts for the composite trip is guaranteed to be the same as the start_ts for the confirmed trip.

Let's go with the second approach. It does not have the problem with widely separated trips, and it does not change the code structure. It simply replaces a call on a non-indexed field with a call on an indexed field.

shankari commented 1 year ago

Digression to investigate some scalability improvements from https://github.com/e-mission/e-mission-docs/issues/948

Notably, we frequently take ~ 10 seconds for the first call to complete_ts

1691900798386,"2023-08-13 04:26:38,385:DEBUG:139761210832448:START POST /pipeline/get_range_ts"
1691900798386,"2023-08-13 04:26:38,386:DEBUG:139761210832448:methodName = skip, returning <class 'emission.net.auth.skip.SkipMethod'>"
1691900798386,"2023-08-13 04:26:38,386:DEBUG:139761210832448:Using the skip method to verify id token ... of length 36"
1691900798394,"2023-08-13 04:26:38,394:DEBUG:139761210832448:retUUID = ...

1691900808505,"2023-08-13 04:26:48,505:DEBUG:139761210832448:Returning complete_ts = 1691889354.233"
1691900808505,"2023-08-13 04:26:48,505:DEBUG:139761210832448:Returning range (1659322488.216, 1691889349.233)"
1691900808505,"2023-08-13 04:26:48,505:DEBUG:139761210832448:END POST /pipeline/get_range_ts ... 10.11975884437561 "

although subsequent calls are faster.

1691900834916,"2023-08-13 04:27:14,916:DEBUG:139761200342592:START POST /pipeline/get_range_ts"

1691900835215,"2023-08-13 04:27:15,215:DEBUG:139761200342592:Returning complete_ts = 1691889354.233"
1691900835215,"2023-08-13 04:27:15,215:DEBUG:139761200342592:Returning range (1659322488.216, 1691889349.233)"
1691900835215,"2023-08-13 04:27:15,215:DEBUG:139761200342592:END POST /pipeline/get_range_ts"

This is super important since this is the first call for every retrieval and having this take 10 seconds is not really acceptable.

I had originally assumed that this was from the complete_ts call, which made no sense, since it pulls that data directly from the pipeline states.

complete_ts = esp.get_complete_ts(user_id)

However, I now see that we first get the range from the trips and then get the complete_ts, so it could be either, and is far more likely to be the first one

    ts = esta.TimeSeries.get_time_series(user_id)
    start_ts = ts.get_first_value_for_field("analysis/composite_trip", "data.start_ts", pymongo.ASCENDING)
...
    end_ts = ts.get_first_value_for_field("analysis/composite_trip", "data.end_ts", pymongo.DESCENDING)
...
    complete_ts = get_complete_ts(user_id)

if this is causing the slowdown, and the DocumentDB team is not able to fix it, we should probably fix by pre-caching the pipeline range at the end of the pipeline run.

@JGreenlee for visibility

shankari commented 1 year ago

On a related note, and looking at https://github.com/e-mission/e-mission-docs/issues/942#issuecomment-1676410882, where we were taking 30 seconds for a single lookup by an indexed field (data.ts), I wrote a simple script to print out some basic statistics about DB sizes and access times (attached).

I then ran it against all the production databases by hacking/expanding https://github.com/e-mission/e-mission-docs/issues/942#issuecomment-1674063739

The result csv is below, and attached

program raw_timeseries_size analysed_timeseries_size len_entries_in_range retrieve_location_range_time retrieve_single_entry_with_id_time retrieve_single_entry_with_ts_time
0 open_access 30634701 1841576 42362 315.825 0.017324 11.5839
1 nrel_commute 9115634 507854 6754 98.9754 0.0117631 27.0977
2 mm_masscec 22649423 966169 70780 221.948 0.0120975 152.7
3 denver_casr 10273935 1063449 259431 255.146 0.163944 1.2389
4 smart_commute_ebike 2860865 264346 38506 51.0785 0.132923 1.36704
5 uprm_nicr 5272631 264194 31674 93.4192 0.151773 11.4744
6 ebikes_for_essentials 2471371 268972 18151 35.7896 0.0820957 2.86416
7 cortezebikes 1643523 215805 18645 32.3265 0.0976193 3.7595
8 ccebikes 16812728 1226622 325829 281.777 0.0258702 0.816994
9 ebikegj 5614952 437830 153270 107.633 0.0172236 2.79284
10 choose_your_ride 1730103 201746 42847 25.191 0.0126362 0.524229
11 ebikethere_garfield_county 4587180 422702 117008 68.4168 0.173241 1.49514
12 ca_ebike 563138 57894 17752 8.41705 0.00962978 0.582538
13 fortmorgan 521101 75260 11502 5.21246 0.0204601 0.780257
14 usaid_laos_ev 186668 14350 3360 1.3242 0.0217177 1.44233

scalability_results_round_1.csv program_size_and_query_time.py.gz

shankari commented 1 year ago

Here's the same result in graph form We can see that:

Image

shankari commented 1 year ago

Following up on https://github.com/e-mission/e-mission-docs/issues/942#issuecomment-1677846253 I added in the time to find the first trip and the last trip for each program

New results are:

program raw_timeseries_size analysed_timeseries_size len_entries_in_range retrieve_location_range_time retrieve_single_entry_with_id_time retrieve_single_entry_with_ts_time retrieve_first_trip_for_user_time retrieve_last_trip_for_user_time retrieve_complete_ts_for_user_time
0 open_access 30640974 1860401 42362 332.034 0.0159355 3.60054 4.22115 0.0400267 0.0332904
1 nrel_commute 9119362 508326 6754 106.926 0.0932374 21.7889 9.92843 0.0943718 0.0137558
2 mm_masscec 22659958 966607 70780 271.97 0.0239395 24.9989 3.01057 0.0608934 0.0427971
3 denver_casr 10298147 1066292 259431 128.11 0.0913577 0.845236 1.60049 0.0179492 0.0594237
4 smart_commute_ebike 2862420 264853 38506 33.4999 0.223669 6.98352 7.37743 0.0492139 0.033423
5 uprm_nicr 5273248 264194 31674 49.5853 0.108283 1.48983 2.37829 0.0183116 0.037815
6 ebikes_for_essentials 2482810 269674 18151 28.2285 0.051417 1.06774 1.12528 0.015482 0.0478389
7 cortezebikes 1648025 216873 18645 22.0812 0.0126098 0.971451 4.94458 0.0265731 0.0451303
8 ccebikes 16840737 1228313 325829 227.473 0.0124022 1.46591 3.92774 0.0801908 0.0451017
9 ebikegj 5628333 440100 153270 87.0848 0.0145204 2.19927 1.78379 0.0193664 0.04728
10 choose_your_ride 1744706 203427 42847 19.7551 0.0862026 0.310989 0.86218 0.0516913 0.0433836
11 ebikethere_garfield_county 4595259 423523 117008 56.8704 0.114763 0.547563 4.08938 0.0420983 0.046549
12 ca_ebike 563410 57894 17752 6.69122 0.0187379 1.55731 1.56463 0.0285704 0.0411629
13 fortmorgan 522200 75545 11502 4.46128 0.0150901 2.60688 2.51811 0.0274105 0.0387707
14 usaid_laos_ev 186902 14350 3360 0.894163 0.0133026 0.362137 0.448559 0.00895064 0.014204

scalability_results_round_2.csv

shankari commented 1 year ago

New results are:

Notably:

Image

We use the pipeline range_ts even for the current production version, so it is critical to get this out into production ASAP

shankari commented 1 year ago

given that retrieve_single_entry_with_ts is taking ~ 150 secs = 2 mins sometimes, I wondered if it is worthwhile to replace the implementation. Then I checked, at it appears to be taking 10 minutes without the index. Ok, so we do need it 😄 https://github.com/e-mission/e-mission-docs/issues/942#issuecomment-1676052328

1691861491149,"2023-08-12 17:31:31,149:DEBUG:139623946450752:update called"
1691861491149,"2023-08-12 17:31:31,149:DEBUG:139623946450752:Saving entry Entry({'_id': ObjectId('64d06524fc6a934b4ab56230'), 'metadata': {'key': 'analysis/confirmed_trip',

1691861981828,"2023-08-12 17:39:41,825:DEBUG:139623946450752:update called"
1691861981828,"2023-08-12 17:39:41,828:DEBUG:139623946450752:Saving entry Entry({'_id': ObjectId('64d06598fc6a934b4ab56234'), 'metadata': {'key': 'analysis/composite_trip',

1691861982575,"2023-08-12 17:39:42,575:DEBUG:139623946450752:curr_query = {'invalid': {'$exists': False}, ...
shankari commented 1 year ago

So this change https://github.com/e-mission/e-mission-docs/issues/942#issuecomment-1677776874 broke the unit tests since we copy over all entries (including the start and end timestamps) when updating composite with confirmed. So if the confirmed trip start_ts has been modified, then the composite trip will no longer match.

        composite_trip = ts.get_entry_at_ts("analysis/composite_trip", "data.start_ts", confirmed_obj.data.start_ts)
        if composite_trip is not None:
            # copy over all the fields other than the end_confimed_place
            EXCLUDED_FIELDS = ["end_confirmed_place"]
            for k in confirmed_obj["data"].keys():
                if k not in EXCLUDED_FIELDS:
                    composite_trip["data"][k] = confirmed_obj["data"][k]
            estbt.BuiltinTimeSeries.update(ecwe.Entry(composite_trip))

There are two potential fixes:

As far as I can make out, grep -r update_confirmed_and_composite emission only returns two call sites, both of which are in emission/analysis/userinput/matcher.py, and one of them is purely focused on the last_trip. In the other, we fill in the single and multiple inputs for the trip. So why do we need to copy all the data over? Why not just the user inputs?

    for ui in toMatchInputs:
        confirmed_obj = esdt.get_confirmed_obj_for_user_input_obj(ts, ui)
        if confirmed_obj is not None:
            if ui.metadata.key in single_most_recent_match_key_list:
                handle_single_most_recent_match(confirmed_obj, ui)
            elif ui.metadata.key in multi_non_deleted_key_list:
                handle_multi_non_deleted_match(confirmed_obj, ui)
            else:
                assert False, "Found weird key {ui.metadata.key} that was not in the search list"
            update_confirmed_and_composite(ts, confirmed_obj)
shankari commented 1 year ago

I looked up the commit, and it was indeed designed to only fill in the user inputs and additions https://github.com/e-mission/e-mission-server/commit/149025b12c66e3b9708dfd505bc670a08d06fd69

@JGreenlee can you think of any other fields that need to be updated? if not, I will change the unit test, finish this change and move on to the pipeline range fix

JGreenlee commented 1 year ago

As far as I know, those are the only things that can be changed. Unless we plan to support trip editing anytime soon ;)

JGreenlee commented 1 year ago

Unless inferences can change?

shankari commented 1 year ago

Where do we store the pipeline range?

The obvious option would be to store it in the pipeline_states collection because it represents the range of the pipeline states. But it won't actually be in the format of the PipelineStage with curr_run_ts etc. mongo doesn't care about this, but future databases, and the export, might

What we really want to do is to store one pair per user and update it as the pipeline runs. Where we we store the per-user updateable information? In the profile! So an alternate option would be to store it in the profile.

I vastly prefer (2) because of consistency. We can still support the fallbacks (composite -> confirmed -> cleaned) in that case by updating the profile as each stage of the pipeline completes.

shankari commented 1 year ago

Unless inferences can change?

We make one pass through the pipeline, filling in inferences. So once we have filled in inferences, we will not change them in general. if we changed inferences now, we would not account for them, since we are not calling update_confirmed_and_composite after any changes.

I did note, while adding the section summaries, that there is a current hole in which:

But when we deal with by handling the downstream propagation to the confirmed trips, we can also deal with further propagation to composite trips - it will just be one more set of entries to copy.

shankari commented 1 year ago

@JGreenlee final question on pipeline range - we currently have a backup solution, where if we don't find composite trips, we fall back to confirmed trips and if we don't find those, we fall back to cleaned trips.

https://github.com/e-mission/e-mission-server/blob/8761f4afce635bc4cc7ff1732d9f00956cb5c4ad/emission/net/api/pipeline.py#L21

I added this code in https://github.com/e-mission/e-mission-server/commit/014df2dbe905feb18766d379442427ff26d93132 and I basically had it in mind as a backwards compat solution - if any community members have upgraded the client but not upgraded the server, it would be helpful to support them as well. And we weren't really doing anything super fancy with confirmed trips at the time, so cleaned trip would display in the UI just fine.

But with the new version of the client, we only support composite trips, so they have to upgrade the server as well. So I'm planning to remove the fallbacks.

  1. It will simplify the code dramatically, and
  2. it will fix a hole in which the cleaning stage (say) works and has processed upto t5, but there is an error in the confirmed trip creation at trip t3, so we only have confirmed and composite trips upto t2 (from a previous run). With the current implementation, our pipeline range will appear to end at t5, but so we will only query for draft trips after t5. But we don't use cleaned trips in the display, so it will look like trips t3, t4 and t5 have vanished mysteriously

Thoughts?

JGreenlee commented 1 year ago

I am fine with that. At least on the phone, it would be pretty burdensome to support cleaned or confirmed objects. And for type safety, I think it's better to be more or less deterministic as to what kind of objects are being received from the server

I think it's ok to simplify this and make maintenance easier. If anyone in the community still wants to use an old version of the server, they can do so with a the pre-React phone code (or their own pre-React fork)

shankari commented 1 year ago

Also removing the check against the complete_ts since it was a check while adding the new method, but the check has not been invoked even once in ~ 1.5 years of the open_access and nrel_commute programs (I am not sure where the search through the logs is, but it has been running for a few hours and we haven't found any matches.

Image Image

shankari commented 1 year ago

Finally, there are other places where we look up entries that are not indexed. Notably, in the newly added get_section_summary, we call emission/storage/decorations/trip_queries.py's get_sections_for_trip

which searches by trip_id, which is not indexed.

def get_sections_for_trip(key, user_id, trip_id):
    # type: (UUID, object_id) -> list(sections)
    """
    Get the set of sections that are children of this trip.
    """
    query = {"user_id": user_id, "data.trip_id": trip_id,
             "metadata.key": key}
    logging.debug("About to execute query %s with sort_key %s" % (query, "data.start_ts"))
    section_doc_cursor = edb.get_analysis_timeseries_db().find(query).sort(
        "data.start_ts", pymongo.ASCENDING)
    return [ecwe.Entry(doc) for doc in section_doc_cursor]

However, we don't have any logs to determine whether this is slow, and to quantify the improvement (if any) of the change. So I am adding logs for now, and will change it later if it turns out to be an issue, especially for people with lots of trips.

shankari commented 1 year ago

Finally, steps to deploy the changes.

Now, I'm getting an error while creating composite trips for 13 of the users.

There are two possible fixes:

The second option seems better to avoid the gap, but we do still need to decide what to do about the weird composite trip errors.

shankari commented 1 year ago

We have an existing script called

./bin/debug/delete_composite_objects_and_state.py

which takes in a list of uuids

Let's call that on the 13 users and re-run the pipeline

If that works, we can run the pipeline on all production systems overnight and upgrade tomorrow.

shankari commented 1 year ago

Re-running the pipeline didn't help, since the error is that the starting_trip field is set, but the related object is not found - it is a cleaned trip. There is also apparently a confirmed place which does not have a duration

Error while matching incoming user inputs, timestamp is unchanged
Traceback (most recent call last):
  File "/home/kshankar/e-mission-server/emission/analysis/userinput/matcher.py", line 23, in match_incoming_user_inputs
    last_user_input_done = match_incoming_inputs(user_id, time_query)
  File "/home/kshankar/e-mission-server/emission/analysis/userinput/matcher.py", line 45, in match_incoming_inputs
    confirmed_obj = esdt.get_confirmed_obj_for_user_input_obj(ts, ui)
  File "/home/kshankar/e-mission-server/emission/storage/decorations/trip_queries.py", line 293, in get_confirmed_obj_for_user_input_obj
    return final_candidate(valid_timeline_entry(ts, ui_obj), potential_candidates)
  File "/home/kshankar/e-mission-server/emission/storage/decorations/trip_queries.py", line 177, in final_candidate
    extra_filtered_potential_candidates = list(filter(filter_fn, potential_candidate_objects))
  File "/home/kshankar/e-mission-server/emission/storage/decorations/trip_queries.py", line 262, in curried
    return valid_user_input_for_timeline_entry(ts, confirmed_obj, user_input)
  File "/home/kshankar/e-mission-server/emission/storage/decorations/trip_queries.py", line 166, in valid_user_input_for_timeline_entry
    (overlapDuration, tl_entry.data.duration, (overlapDuration / tl_entry.data.duration)));
  File "/home/kshankar/e-mission-server/emission/core/wrapper/wrapperbase.py", line 71, in __getattr__
    raise AttributeError(
AttributeError: 'Confirmedplace' instance has no attribute 'duration'

Writing a similar function for confirmed objects so that we can delete and recreate them only

shankari commented 1 year ago

Wait a minute - for the forward migration, we don't need to run the full, non-skipped pipeline, which can take several hours even on staging. We just need a migration script that calls _get_and_store_range for every user.

shankari commented 1 year ago

That was easy! The pipeline seems to have other broken Confirmed objects, so will update the reset scripts and fix those separately. But at least with this, we can get the pipeline ranges fixed and move on.

shankari commented 1 year ago

For the broken confirmed objects, we could simply reset the pipeline, but that will reset everything, since we have no/broken confirmed places. And running the trip and section segmentation currently takes a Very Long Time.

We also currently have several users "stuck" at stage 2. When we upgrade, those pipelines will be killed. Instead of re-running stage 1, it would be good if we could just reset stage 2

Let's modify the ./bin/debug/delete_composite_objects_and_state.py to be more general

shankari commented 1 year ago

Migration script commits are:

shankari commented 1 year ago

After the scalability improvements were deployed, the load on the database has dropped significantly

Image

And so has utilization of the compute cluster

Image

Couple of caveats:

shankari commented 1 year ago

But the pipelines (e..g stuck in draft trips) is not an issue any more, so I declare this closed.

shankari commented 1 year ago

After a few days: MicrosoftTeams-imaged5a4a6edc58ad1de788a1d57641462b11a33488eac2dcb5ca63865545edc87a1

MicrosoftTeams-image1dd273fab5b8a71186bcaa4180378407f97ae742440190b2dea7e72dbfd78b00