No processed trips for a few days; buildup of draft trips

JGreenlee commented 1 year ago

I had a busy day of travel on Saturday (3 days ago).

My last trip showing up as processed was 3:34-3:44 PM. All trips from that point onward (approx. 30) are draft trips.

Processing should happen once every hour and we should never see a buildup of unprocessed travel this long. This could be a sync issue on the phone, or it could be a processing issue on the server.

JGreenlee commented 1 year ago

Local logs indicate an error is being thrown in the sync plugin: ServerSyncAdapter : IO Error java.io.IOException while posting converted trips to JSON, which is coming from https://github.com/e-mission/cordova-server-sync/blob/01c4e4cb7cd9bfd1b86f451d184ae187139ab7e2/src/android/ServerSyncAdapter.java#L148

This could be the cause; however other recent logs would indicate sync is still happening successfully.

JGreenlee commented 1 year ago

I'll also note that there was significant travel in the gap between 3:44 and 8:42 PM. I remember my phone battery was low most of the day, so battery saver was probably on and off at different parts of the day.

shankari commented 1 year ago

I remember my phone battery was low most of the day, so battery saver was probably on and off at different parts of the day.

It would be good to see if this is in fact the issue. I have been testing with battery saver off and on (#940) and the trajectories may be a bit worse, but I don't see any big gaps in sensing.

shankari commented 1 year ago

To clarify, this was because the servers had been very unstable all last week and were crashing multiple times. One of the crashes must have occurred when the pipeline for this user was running. Once it crashed while running, all subsequent runs would crash because curr_run_ts was set

I reset the pipeline and the user was able to see their trips.

shankari commented 1 year ago

This occurred again yesterday. The servers crashed at around 4:18pm and then again at around 5:14pm.

Since this is the second time this has happened, I wrote a script to iterate through all the active production deployments and check how many invalid pipeline states we had.

import os
import re
import copy
import json

storage_conf_template = json.load(open("conf/storage/db.conf"))
print(storage_conf_template.keys())

for prod in PROD_LIST:
    prod_db_name = prod.replace("-", "_")
    print(f"Resetting pipeline for {prod=} with {prod_db_name=}")
    curr_storage_conf = copy.deepcopy(storage_conf_template)
    curr_storage_conf['timeseries']['url'] = curr_storage_conf['timeseries']['url'].replace("REPLACEME", prod_db_name)
    print(f"After resetting, {storage_conf_template=}, {curr_storage_conf=}")
    json.dump(curr_storage_conf, open("conf/storage/db.conf", "w"))
    os.system("echo conf/storage.db.conf")
    check_reset_cmd = "./e-mission-py.bash bin/monitor/reset_invalid_pipeline_states.py -c"
    os.system(check_reset_cmd)

json.dump(storage_conf_template, open("conf/storage/db.conf", "w"))

shankari commented 1 year ago

Number of users that need to be reset varies from ~ 1 to ~ 5. Starting with the ones in an attempt to spread out the load. I wish it were possible to check the CPU load directly..

shankari commented 1 year ago

Cloud Services indicated that the issue was related to high CPU load. Let us look at some of the long tasks that they flagged. One of the long tasks was from denver_casr

Looking at the logs for one hour (attached), we notice that:

we processed 126 users in the hour

$ wc -l /tmp/one_hour_of_denver_casr
     126 /tmp/one_hour_of_denver_casr

However, 94 of them did not process any data

$ grep unchanged /tmp/one_hour_of_denver_casr | wc -l
      94

and only 31 actually had data to process

$ grep "last_ts_processed = " /tmp/one_hour_of_denver_casr | wc -l
      31

Even the ones that did not have data to process took about 30 seconds each

"2023-08-10 23:41:35,463:INFO:139958568953664:For stage PipelineStages.USERCACHE, last_ts_processed is unchanged"
"2023-08-10 23:42:23,039:INFO:139958568953664:For stage PipelineStages.USERCACHE, last_ts_processed is unchanged"
"2023-08-10 23:42:58,393:INFO:139958568953664:For stage PipelineStages.USERCACHE, last_ts_processed is unchanged"
"2023-08-10 23:43:19,324:INFO:139958568953664:For stage PipelineStages.USERCACHE, last_ts_processed is unchanged"
"2023-08-10 23:43:52,610:INFO:140488176322368:For stage PipelineStages.USERCACHE, last_ts_processed is unchanged"
"2023-08-10 23:44:05,465:INFO:139958568953664:For stage PipelineStages.USERCACHE, last_ts_processed is unchanged"

But that means that it must have taken ~ 47 minutes just to process and skip those users with no data

>>> (30 * 94) / 60
47

A simple fix seems to be that, if the user has no incoming data (e.g. nothing moved to long_term), we can skip the rest of the pipeline.

The problem with that approach is that then, if there are downstream pipeline stages that have not processed fully - e.g. due to a crash, and then we reset the pipeline, the downstream stages will not be processed until new data comes in. And if there is no new data coming in (the user has uninstalled the app), then we will never reprocess the data.

However, it seems like a fairly simple fix is to pass in a flag indicating whether to bail out early or not. We can set it to true (bail out early) on production for now, and consider running with the flag turned off (don't bail early) on a separate "catch up" AWS instance. Or turn the flag off after we discuss other options for scalability.... Or...

one_hour_of_denver_casr.gz

shankari commented 1 year ago

This is a super simple fix which I will plan on pushing to staging tomorrow

shankari commented 1 year ago

While we are looking at early returns from the pipeline, let's see if we can do something similar for the trip model as well. If we look at the Durham logs, for example, the pipeline ran for 30 mins even though there was no actual data coming in.

2023-08-10 02:00:52,228
2023-08-10 02:24:41,112

The current trip model pipeline is not incremental, so we cannot look to see if there are "new" trips. However, it is also run once a day, so if the last trip is more than a day old (or ~ 3 days old to add in some margin) we can do an early return.

shankari commented 1 year ago

The logs for Durham are:

2023-08-10 02:00:52,228:INFO:building model for user
2023-08-10 02:00:52,228:DEBUG:getting key model_type in config
2023-08-10 02:00:52,228:DEBUG:getting key model_storage in config
2023-08-10 02:00:52,228:DEBUG:getting key minimum_trips in config
2023-08-10 02:00:52,228:DEBUG:getting key model_parameters.greedy in config
2023-08-10 02:00:52,228:DEBUG:GreedySimilarityBinning loaded model config from file
2023-08-10 02:00:54,309:DEBUG:retrieved latest trip model recorded at timestamp 1691546460.3530114
2023-08-10 02:00:54,369:INFO:For stage PipelineStages.TRIP_MODEL, start_ts = 2022-11-01T23:29:05.438816
2023-08-10 02:00:54,389:DEBUG:time query for training data collection: None
2023-08-10 02:00:54,993:DEBUG:found 38 labeled trips for user
2023-08-10 02:00:56,036:DEBUG:upsert_doc called with key inference/trip_model
2023-08-10 02:00:56,292:DEBUG:GREEDY_SIMILARITY_BINNING label prediction model built for user with timestamp 1667345340.438816
2023-08-10 02:00:56,311:INFO:building model for user ....

So the run took ~ 4 seconds and ~ 2 seconds were spent in loading the model. Unfortunately, we load the model before loading the trips.

So we have two choices:

add a new check before reading the model to get the last recorded trip (by using ts.get_first_value_for_field)
move the trip load before the model load

The first feels more hacky, but is also likely to be more performant, especially for longer data collection durations, since we are not currently using an incremental model. Once we switch to an incremental model, we might want to implement (2) since it is more principled and less hacky.

@humbleOldSage for visibility and to keep in mind while switching models in the future.

shankari commented 1 year ago

Cloud services bumped up the server size, so I reset everything and reran. As of yesterday, there were just some slow-running pipelines for masscec and open access.

But today, we ran into a new entry in ccebikes which was stuck at the USER_INPUT stage.

On checking, it really was taking a lot of time to process the user inputs (I guess the user labeled a lot of trips at once?) I was monitoring the run, and noticed suddenly that the task seemed to hang/terminated. So I assumed that the tasks were terminating randomly.

However, there were a few more logs when I came back to take a look at it, and it looked like the saves were taking ~ 20 minutes each

1691861491149,"2023-08-12 17:31:31,149:DEBUG:139623946450752:update called"
1691861491149,"2023-08-12 17:31:31,149:DEBUG:139623946450752:Saving entry Entry({'_id': ObjectId('64d06524fc6a934b4ab56230'), 'metadata': {'key': 'analysis/confirmed_trip',

1691861981828,"2023-08-12 17:39:41,825:DEBUG:139623946450752:update called"
1691861981828,"2023-08-12 17:39:41,828:DEBUG:139623946450752:Saving entry Entry({'_id': ObjectId('64d06598fc6a934b4ab56234'), 'metadata': {'key': 'analysis/composite_trip',

1691861982575,"2023-08-12 17:39:42,575:DEBUG:139623946450752:curr_query = {'invalid': {'$exists': False}, ...

The really weird thing is that the confirmed trip saves were taking 20 mins, but the subsequent composite trip saves were taking almost no time. What are we doing between those two calls?

1691863387241,"2023-08-12 18:03:07,240:DEBUG:139623946450752:update called"
1691863387241,"2023-08-12 18:03:07,240:DEBUG:139623946450752:Saving entry Entry({'_id': ObjectId('64ce9c41618d03e1e6a41e95'), 'user_id': UUID('1355e661-e7fa-4450-aae8-dd828fca35e6'), 'metadata': {'key': 'analysis/confirmed_trip',

1691864075195,"2023-08-12 18:14:35,195:DEBUG:139623946450752:update called"
1691864075197,"2023-08-12 18:14:35,197:DEBUG:139623946450752:Saving entry Entry({'_id': ObjectId('64ce9dac618d03e1e6a41e98'), 'metadata': {'key': 'analysis/composite_trip',

1691864075717,"2023-08-12 18:14:35,717:DEBUG:139623946450752:curr_query = {'invalid': {'$exists': False}, ...

shankari commented 1 year ago

So the update and saving calls happen together

        logging.debug("update called")
        ts = esta.TimeSeries.get_time_series(entry.user_id)
        logging.debug("Saving entry %s into timeseries" % entry)
        edb.save(ts.get_timeseries_db(entry.metadata.key), entry)

I assume that the save is not slow in general, since the composite trip call is fast, although the composite trip is a bigger object. We need to see what (if anything) we are doing between the two saves.

shankari commented 1 year ago

Bingo! Here's the code.

We first update the confirmed object, and then we find the composite trip by find the trip for which this is the confirmed_trip.

    estbt.BuiltinTimeSeries.update(confirmed_obj)
    # we update confirmed object in the pipeline, but they are no longer the terminal data structures
    # that we display. So when the confirmed objects change, we need to find the corresponding
    # terminal objects (composite trips) and update them as well
    # if we don't find a matching composite trip, we don't need to do anything
    # since it has not been created yet and will be created with updated values when we get to that stage
    if confirmed_obj["metadata"]["key"] in [esda.CONFIRMED_TRIP_KEY, esda.CONFIRMED_UNTRACKED_KEY]:
        composite_trip = edb.get_analysis_timeseries_db().find_one({"data.confirmed_trip": confirmed_obj.get_id()})
        if composite_trip is not None:
            # copy over all the fields other than the end_confimed_place
            EXCLUDED_FIELDS = ["end_confirmed_place"]
            for k in confirmed_obj["data"].keys():
                if k not in EXCLUDED_FIELDS:
                    composite_trip["data"][k] = confirmed_obj["data"][k]
            estbt.BuiltinTimeSeries.update(ecwe.Entry(composite_trip))
        else:
            logging.debug("No composite trip matching confirmed trip %s, nothing to update" % confirmed_obj["_id"])

But there is no index on the field data.confirmed_trip!! So that query is taking 20 mins each on some of the databases

def _create_analysis_result_indices(tscoll):
    tscoll.create_index([("metadata.key", pymongo.ASCENDING)])

    # trips and sections
    tscoll.create_index([("data.start_ts", pymongo.DESCENDING)], sparse=True)
    tscoll.create_index([("data.end_ts", pymongo.DESCENDING)], sparse=True)
    _migrate_sparse_to_dense(tscoll, "data.start_loc_2dsphere")
    _migrate_sparse_to_dense(tscoll, "data.end_loc_2dsphere")
    tscoll.create_index([("data.start_loc", pymongo.GEOSPHERE)])
    tscoll.create_index([("data.end_loc", pymongo.GEOSPHERE)])
    _create_local_dt_indices(tscoll, "data.start_local_dt")
    _create_local_dt_indices(tscoll, "data.end_local_dt")

@JGreenlee for visibility

shankari commented 1 year ago

that stage finally completed and the thread has moved on. It doesn't show up as needing reset at this point.

shankari commented 1 year ago

There are a couple of potential fixes for the above.

We can add an index for the data.confirmed_trip
We can also load all composite trips for the specified time range, and then just match in memory. (1) seems easier at first glance, but I think that (2) is actually better because we should generally optimize timeseries accesses over field-based links

shankari commented 1 year ago

I noticed also that several of the users are stuck in stage 2 (section segmentation), not just trip segmentation Checking the logs, it looks like several calls which involve indices are taking multiple minutes to run

2023-08-13 16:28:15,503:DEBUG:139939849852736:curr_query = {'invalid': {'$exists': False}, 'user_id': UUID(...), '$or': [{'metadata.key': 'background/motion_activity'}], 'data.ts': {'$lte': 1671926016.0008862, '$gte': 1671925742.2325292}}, sort_key = data.ts

2023-08-13 16:31:20,377:DEBUG:139939849852736:curr activity = Motionactivity({'unknown': False, 'cycling': False, 'automotive': False, 'ts': 1671925801.0545607, 'confidence': 100, 'stationary': False, 'walking': False, 'running': False, 'type': 9, 'confidence_level': 'high', 'fmt_time': '2022-12-24T18:50:01.054561-05:00', 'local_dt_year': 2022, 'local_dt_month': 12, 'local_dt_day': 24, 'local_dt_hour': 18, 'local_dt_minute': 50, 'local_dt_second': 1, 'local_dt_weekday': 5, 'local_dt_timezone': 'America/New_York', '_id': ObjectId('63a7910680ea0c211d6aeb81'), 'user_id': ..., 'metadata_write_ts': 1671925801.0545607}), returning False

Because of this, for a particular masscec user, although the trip segmentation is at July 2023, the section segmentation is still stuck at September 2022

Even looking up one entry by the timestamp, when the data.ts is indexed, is taking 30 seconds

2023-08-13 16:54:14,216:DEBUG:140481331697472:get_entry_at_ts query = {'user_id': UUID('...'), 'metadata.key': 'background/filtered_location', 'data.ts': 1663716286.000039}

2023-08-13 16:54:35,776:DEBUG:140481331697472:get_entry_at_ts result = {'_id': ObjectId('63321b30aa2dde12ed3db589'), 'user_id': UUID('...'), 'metadata': {'time_zone': 'America/New_York', 'plugin': 'none', 'platform': 'ios', 'read_ts': 0, 'key': 'background/filtered_location', 'type': 'sensor-data',...

shankari commented 1 year ago

Documenting design decisions: for https://github.com/e-mission/e-mission-docs/issues/942#issuecomment-1674108246 if there are no new entries, we don't modify the pipelinestate at all. That seems fairly consistent and neat... So if the pipeline state didn't exist before, it won't be created. And the last_run_ts won't change.

shankari commented 1 year ago

wrt

We can also load all composite trips for the specified time range, and then just match in memory. (1) seems easier at first glance, but I think that (2) is actually better because we should generally optimize timeseries accesses over field-based links

from https://github.com/e-mission/e-mission-docs/issues/942#issuecomment-1676108597, this is a bit tricky because the range of affected confirmed/composite trips can be anything - people can go back and label trips for a year ago today.

So we can fix this by first determining the confirmed trips that are affected, and then pulling all the composite trips in that time range and matching them up. This can still have bad performance for a user who (a) has a lot of trips, and (b) labels trips across a wide range (e.g. one trip a year ago and one trip from a week ago)

An alternative is to find the composite trip using the start_ts and not the id - the start_ts for the composite trip is guaranteed to be the same as the start_ts for the confirmed trip.

Let's go with the second approach. It does not have the problem with widely separated trips, and it does not change the code structure. It simply replaces a call on a non-indexed field with a call on an indexed field.

shankari commented 1 year ago

Digression to investigate some scalability improvements from https://github.com/e-mission/e-mission-docs/issues/948

Notably, we frequently take ~ 10 seconds for the first call to complete_ts

1691900798386,"2023-08-13 04:26:38,385:DEBUG:139761210832448:START POST /pipeline/get_range_ts"
1691900798386,"2023-08-13 04:26:38,386:DEBUG:139761210832448:methodName = skip, returning <class 'emission.net.auth.skip.SkipMethod'>"
1691900798386,"2023-08-13 04:26:38,386:DEBUG:139761210832448:Using the skip method to verify id token ... of length 36"
1691900798394,"2023-08-13 04:26:38,394:DEBUG:139761210832448:retUUID = ...

1691900808505,"2023-08-13 04:26:48,505:DEBUG:139761210832448:Returning complete_ts = 1691889354.233"
1691900808505,"2023-08-13 04:26:48,505:DEBUG:139761210832448:Returning range (1659322488.216, 1691889349.233)"
1691900808505,"2023-08-13 04:26:48,505:DEBUG:139761210832448:END POST /pipeline/get_range_ts ... 10.11975884437561 "

although subsequent calls are faster.

1691900834916,"2023-08-13 04:27:14,916:DEBUG:139761200342592:START POST /pipeline/get_range_ts"

1691900835215,"2023-08-13 04:27:15,215:DEBUG:139761200342592:Returning complete_ts = 1691889354.233"
1691900835215,"2023-08-13 04:27:15,215:DEBUG:139761200342592:Returning range (1659322488.216, 1691889349.233)"
1691900835215,"2023-08-13 04:27:15,215:DEBUG:139761200342592:END POST /pipeline/get_range_ts"

This is super important since this is the first call for every retrieval and having this take 10 seconds is not really acceptable.

I had originally assumed that this was from the complete_ts call, which made no sense, since it pulls that data directly from the pipeline states.

complete_ts = esp.get_complete_ts(user_id)

However, I now see that we first get the range from the trips and then get the complete_ts, so it could be either, and is far more likely to be the first one

    ts = esta.TimeSeries.get_time_series(user_id)
    start_ts = ts.get_first_value_for_field("analysis/composite_trip", "data.start_ts", pymongo.ASCENDING)
...
    end_ts = ts.get_first_value_for_field("analysis/composite_trip", "data.end_ts", pymongo.DESCENDING)
...
    complete_ts = get_complete_ts(user_id)

if this is causing the slowdown, and the DocumentDB team is not able to fix it, we should probably fix by pre-caching the pipeline range at the end of the pipeline run.

@JGreenlee for visibility

shankari commented 1 year ago

On a related note, and looking at https://github.com/e-mission/e-mission-docs/issues/942#issuecomment-1676410882, where we were taking 30 seconds for a single lookup by an indexed field (data.ts), I wrote a simple script to print out some basic statistics about DB sizes and access times (attached).

I then ran it against all the production databases by hacking/expanding https://github.com/e-mission/e-mission-docs/issues/942#issuecomment-1674063739

The result csv is below, and attached

	program	raw_timeseries_size	analysed_timeseries_size	len_entries_in_range	retrieve_location_range_time	retrieve_single_entry_with_id_time	retrieve_single_entry_with_ts_time
0	open_access	30634701	1841576	42362	315.825	0.017324	11.5839
1	nrel_commute	9115634	507854	6754	98.9754	0.0117631	27.0977
2	mm_masscec	22649423	966169	70780	221.948	0.0120975	152.7
3	denver_casr	10273935	1063449	259431	255.146	0.163944	1.2389
4	smart_commute_ebike	2860865	264346	38506	51.0785	0.132923	1.36704
5	uprm_nicr	5272631	264194	31674	93.4192	0.151773	11.4744
6	ebikes_for_essentials	2471371	268972	18151	35.7896	0.0820957	2.86416
7	cortezebikes	1643523	215805	18645	32.3265	0.0976193	3.7595
8	ccebikes	16812728	1226622	325829	281.777	0.0258702	0.816994
9	ebikegj	5614952	437830	153270	107.633	0.0172236	2.79284
10	choose_your_ride	1730103	201746	42847	25.191	0.0126362	0.524229
11	ebikethere_garfield_county	4587180	422702	117008	68.4168	0.173241	1.49514
12	ca_ebike	563138	57894	17752	8.41705	0.00962978	0.582538
13	fortmorgan	521101	75260	11502	5.21246	0.0204601	0.780257
14	usaid_laos_ev	186668	14350	3360	1.3242	0.0217177	1.44233

scalability_results_round_1.csv program_size_and_query_time.py.gz

shankari commented 1 year ago

Here's the same result in graph form We can see that:

We are taking close to 5 minutes to retrieve entries over 5 days. Obviously, this makes everything slow 😄 The time to retrieve the location range seems to be proportional to the size of the timeseries, although we have an index on the data.ts key. So what's going on? Is the index not set up? Not working?
there is something going on with masscec, the time taken to retrieve a single entry with data.ts is 150 seconds (> 2 minutes!!) even though it has a lot fewer entries than (say) open_access. Note that to get the retrieve_location_range_time, I pick a random entry and try to re-find it by using its timestamp. But it does show the response time does not appear to be purely related to data size.

shankari commented 1 year ago

Following up on https://github.com/e-mission/e-mission-docs/issues/942#issuecomment-1677846253 I added in the time to find the first trip and the last trip for each program

New results are:

	program	raw_timeseries_size	analysed_timeseries_size	len_entries_in_range	retrieve_location_range_time	retrieve_single_entry_with_id_time	retrieve_single_entry_with_ts_time	retrieve_first_trip_for_user_time	retrieve_last_trip_for_user_time	retrieve_complete_ts_for_user_time
0	open_access	30640974	1860401	42362	332.034	0.0159355	3.60054	4.22115	0.0400267	0.0332904
1	nrel_commute	9119362	508326	6754	106.926	0.0932374	21.7889	9.92843	0.0943718	0.0137558
2	mm_masscec	22659958	966607	70780	271.97	0.0239395	24.9989	3.01057	0.0608934	0.0427971
3	denver_casr	10298147	1066292	259431	128.11	0.0913577	0.845236	1.60049	0.0179492	0.0594237
4	smart_commute_ebike	2862420	264853	38506	33.4999	0.223669	6.98352	7.37743	0.0492139	0.033423
5	uprm_nicr	5273248	264194	31674	49.5853	0.108283	1.48983	2.37829	0.0183116	0.037815
6	ebikes_for_essentials	2482810	269674	18151	28.2285	0.051417	1.06774	1.12528	0.015482	0.0478389
7	cortezebikes	1648025	216873	18645	22.0812	0.0126098	0.971451	4.94458	0.0265731	0.0451303
8	ccebikes	16840737	1228313	325829	227.473	0.0124022	1.46591	3.92774	0.0801908	0.0451017
9	ebikegj	5628333	440100	153270	87.0848	0.0145204	2.19927	1.78379	0.0193664	0.04728
10	choose_your_ride	1744706	203427	42847	19.7551	0.0862026	0.310989	0.86218	0.0516913	0.0433836
11	ebikethere_garfield_county	4595259	423523	117008	56.8704	0.114763	0.547563	4.08938	0.0420983	0.046549
12	ca_ebike	563410	57894	17752	6.69122	0.0187379	1.55731	1.56463	0.0285704	0.0411629
13	fortmorgan	522200	75545	11502	4.46128	0.0150901	2.60688	2.51811	0.0274105	0.0387707
14	usaid_laos_ev	186902	14350	3360	0.894163	0.0133026	0.362137	0.448559	0.00895064	0.014204

scalability_results_round_2.csv

shankari commented 1 year ago

New results are:

Notably:

finding the oldest trip is taking the most time (upto 10 secs for nrel_commute)
finding the newest trip is pretty fast, which also indicates that the problem is likely with the indexing. With a proper tree-based index, there is no reason why finding older entries should take longer than newer entries, searching through the index linearly will
looking up the pipeline state is pretty fast

We use the pipeline range_ts even for the current production version, so it is critical to get this out into production ASAP

shankari commented 1 year ago

given that retrieve_single_entry_with_ts is taking ~ 150 secs = 2 mins sometimes, I wondered if it is worthwhile to replace the implementation. Then I checked, at it appears to be taking 10 minutes without the index. Ok, so we do need it 😄 https://github.com/e-mission/e-mission-docs/issues/942#issuecomment-1676052328

1691861491149,"2023-08-12 17:31:31,149:DEBUG:139623946450752:update called"
1691861491149,"2023-08-12 17:31:31,149:DEBUG:139623946450752:Saving entry Entry({'_id': ObjectId('64d06524fc6a934b4ab56230'), 'metadata': {'key': 'analysis/confirmed_trip',

1691861981828,"2023-08-12 17:39:41,825:DEBUG:139623946450752:update called"
1691861981828,"2023-08-12 17:39:41,828:DEBUG:139623946450752:Saving entry Entry({'_id': ObjectId('64d06598fc6a934b4ab56234'), 'metadata': {'key': 'analysis/composite_trip',

1691861982575,"2023-08-12 17:39:42,575:DEBUG:139623946450752:curr_query = {'invalid': {'$exists': False}, ...

shankari commented 1 year ago

So this change https://github.com/e-mission/e-mission-docs/issues/942#issuecomment-1677776874 broke the unit tests since we copy over all entries (including the start and end timestamps) when updating composite with confirmed. So if the confirmed trip start_ts has been modified, then the composite trip will no longer match.

        composite_trip = ts.get_entry_at_ts("analysis/composite_trip", "data.start_ts", confirmed_obj.data.start_ts)
        if composite_trip is not None:
            # copy over all the fields other than the end_confimed_place
            EXCLUDED_FIELDS = ["end_confirmed_place"]
            for k in confirmed_obj["data"].keys():
                if k not in EXCLUDED_FIELDS:
                    composite_trip["data"][k] = confirmed_obj["data"][k]
            estbt.BuiltinTimeSeries.update(ecwe.Entry(composite_trip))

There are two potential fixes:

add the index
don't support overwriting the confirmed trip start_ts

As far as I can make out, grep -r update_confirmed_and_composite emission only returns two call sites, both of which are in emission/analysis/userinput/matcher.py, and one of them is purely focused on the last_trip. In the other, we fill in the single and multiple inputs for the trip. So why do we need to copy all the data over? Why not just the user inputs?

    for ui in toMatchInputs:
        confirmed_obj = esdt.get_confirmed_obj_for_user_input_obj(ts, ui)
        if confirmed_obj is not None:
            if ui.metadata.key in single_most_recent_match_key_list:
                handle_single_most_recent_match(confirmed_obj, ui)
            elif ui.metadata.key in multi_non_deleted_key_list:
                handle_multi_non_deleted_match(confirmed_obj, ui)
            else:
                assert False, "Found weird key {ui.metadata.key} that was not in the search list"
            update_confirmed_and_composite(ts, confirmed_obj)

shankari commented 1 year ago

I looked up the commit, and it was indeed designed to only fill in the user inputs and additions https://github.com/e-mission/e-mission-server/commit/149025b12c66e3b9708dfd505bc670a08d06fd69

@JGreenlee can you think of any other fields that need to be updated? if not, I will change the unit test, finish this change and move on to the pipeline range fix

JGreenlee commented 1 year ago

As far as I know, those are the only things that can be changed. Unless we plan to support trip editing anytime soon ;)

JGreenlee commented 1 year ago

Unless inferences can change?

shankari commented 1 year ago

Where do we store the pipeline range?

The obvious option would be to store it in the pipeline_states collection because it represents the range of the pipeline states. But it won't actually be in the format of the PipelineStage with curr_run_ts etc. mongo doesn't care about this, but future databases, and the export, might

What we really want to do is to store one pair per user and update it as the pipeline runs. Where we we store the per-user updateable information? In the profile! So an alternate option would be to store it in the profile.

I vastly prefer (2) because of consistency. We can still support the fallbacks (composite -> confirmed -> cleaned) in that case by updating the profile as each stage of the pipeline completes.

shankari commented 1 year ago

Unless inferences can change?

We make one pass through the pipeline, filling in inferences. So once we have filled in inferences, we will not change them in general. if we changed inferences now, we would not account for them, since we are not calling update_confirmed_and_composite after any changes.

I did note, while adding the section summaries, that there is a current hole in which:

the GIS based inferences fail on run1
we fill in the section summaries into the confirmed trip
the GIS based inferences succeed on run2 > run1
we don't update the confirmed or composite trips

But when we deal with by handling the downstream propagation to the confirmed trips, we can also deal with further propagation to composite trips - it will just be one more set of entries to copy.

shankari commented 1 year ago

@JGreenlee final question on pipeline range - we currently have a backup solution, where if we don't find composite trips, we fall back to confirmed trips and if we don't find those, we fall back to cleaned trips.

https://github.com/e-mission/e-mission-server/blob/8761f4afce635bc4cc7ff1732d9f00956cb5c4ad/emission/net/api/pipeline.py#L21

I added this code in https://github.com/e-mission/e-mission-server/commit/014df2dbe905feb18766d379442427ff26d93132 and I basically had it in mind as a backwards compat solution - if any community members have upgraded the client but not upgraded the server, it would be helpful to support them as well. And we weren't really doing anything super fancy with confirmed trips at the time, so cleaned trip would display in the UI just fine.

But with the new version of the client, we only support composite trips, so they have to upgrade the server as well. So I'm planning to remove the fallbacks.

It will simplify the code dramatically, and
it will fix a hole in which the cleaning stage (say) works and has processed upto t5, but there is an error in the confirmed trip creation at trip t3, so we only have confirmed and composite trips upto t2 (from a previous run). With the current implementation, our pipeline range will appear to end at t5, but so we will only query for draft trips after t5. But we don't use cleaned trips in the display, so it will look like trips t3, t4 and t5 have vanished mysteriously

Thoughts?

JGreenlee commented 1 year ago

I am fine with that. At least on the phone, it would be pretty burdensome to support cleaned or confirmed objects. And for type safety, I think it's better to be more or less deterministic as to what kind of objects are being received from the server

I think it's ok to simplify this and make maintenance easier. If anyone in the community still wants to use an old version of the server, they can do so with a the pre-React phone code (or their own pre-React fork)

shankari commented 1 year ago

Also removing the check against the complete_ts since it was a check while adding the new method, but the check has not been invoked even once in ~ 1.5 years of the open_access and nrel_commute programs (I am not sure where the search through the logs is, but it has been running for a few hours and we haven't found any matches.

shankari commented 1 year ago

Finally, there are other places where we look up entries that are not indexed. Notably, in the newly added get_section_summary, we call emission/storage/decorations/trip_queries.py's get_sections_for_trip

which searches by trip_id, which is not indexed.

def get_sections_for_trip(key, user_id, trip_id):
    # type: (UUID, object_id) -> list(sections)
    """
    Get the set of sections that are children of this trip.
    """
    query = {"user_id": user_id, "data.trip_id": trip_id,
             "metadata.key": key}
    logging.debug("About to execute query %s with sort_key %s" % (query, "data.start_ts"))
    section_doc_cursor = edb.get_analysis_timeseries_db().find(query).sort(
        "data.start_ts", pymongo.ASCENDING)
    return [ecwe.Entry(doc) for doc in section_doc_cursor]

However, we don't have any logs to determine whether this is slow, and to quantify the improvement (if any) of the change. So I am adding logs for now, and will change it later if it turns out to be an issue, especially for people with lots of trips.

shankari commented 1 year ago

Finally, steps to deploy the changes.

I built the changes and deployed them on staging. However, this will break the UI for most users because we don't run the pipeline unless there is new data, and we don't fill in the values until the end of the pipeline.
Tried to fix it by running the pipeline manually, realized that the OUTPUT_GEN stage was failing for some users and so their pipeline range was not getting updated
Removed the obsolete OUTPUT_GEN stage (https://github.com/e-mission/e-mission-server/pull/932)
Re-ran the pipeline

Now, I'm getting an error while creating composite trips for 13 of the users.

There are two possible fixes:

Fill in the pipeline range even if we skip because of no new entries
- But this has the limitation that the pipeline range will be missing for the time between the update and the first pipeline run
Run the pipeline manually (without skip) first and then upgrade the containers

The second option seems better to avoid the gap, but we do still need to decide what to do about the weird composite trip errors.

shankari commented 1 year ago

We have an existing script called

./bin/debug/delete_composite_objects_and_state.py

which takes in a list of uuids

Let's call that on the 13 users and re-run the pipeline

If that works, we can run the pipeline on all production systems overnight and upgrade tomorrow.

shankari commented 1 year ago

Re-running the pipeline didn't help, since the error is that the starting_trip field is set, but the related object is not found - it is a cleaned trip. There is also apparently a confirmed place which does not have a duration

Error while matching incoming user inputs, timestamp is unchanged
Traceback (most recent call last):
  File "/home/kshankar/e-mission-server/emission/analysis/userinput/matcher.py", line 23, in match_incoming_user_inputs
    last_user_input_done = match_incoming_inputs(user_id, time_query)
  File "/home/kshankar/e-mission-server/emission/analysis/userinput/matcher.py", line 45, in match_incoming_inputs
    confirmed_obj = esdt.get_confirmed_obj_for_user_input_obj(ts, ui)
  File "/home/kshankar/e-mission-server/emission/storage/decorations/trip_queries.py", line 293, in get_confirmed_obj_for_user_input_obj
    return final_candidate(valid_timeline_entry(ts, ui_obj), potential_candidates)
  File "/home/kshankar/e-mission-server/emission/storage/decorations/trip_queries.py", line 177, in final_candidate
    extra_filtered_potential_candidates = list(filter(filter_fn, potential_candidate_objects))
  File "/home/kshankar/e-mission-server/emission/storage/decorations/trip_queries.py", line 262, in curried
    return valid_user_input_for_timeline_entry(ts, confirmed_obj, user_input)
  File "/home/kshankar/e-mission-server/emission/storage/decorations/trip_queries.py", line 166, in valid_user_input_for_timeline_entry
    (overlapDuration, tl_entry.data.duration, (overlapDuration / tl_entry.data.duration)));
  File "/home/kshankar/e-mission-server/emission/core/wrapper/wrapperbase.py", line 71, in __getattr__
    raise AttributeError(
AttributeError: 'Confirmedplace' instance has no attribute 'duration'

Writing a similar function for confirmed objects so that we can delete and recreate them only

shankari commented 1 year ago

Wait a minute - for the forward migration, we don't need to run the full, non-skipped pipeline, which can take several hours even on staging. We just need a migration script that calls _get_and_store_range for every user.

if their pipeline is not running in parallel, this is correct
if their pipeline is running in parallel, it will update the state after CREATE_COMPOSITE_OBJECTS is done
so barring race conditions, which we can resolve through multiple runs, this should fill in the values

shankari commented 1 year ago

That was easy! The pipeline seems to have other broken Confirmed objects, so will update the reset scripts and fix those separately. But at least with this, we can get the pipeline ranges fixed and move on.

shankari commented 1 year ago

For the broken confirmed objects, we could simply reset the pipeline, but that will reset everything, since we have no/broken confirmed places. And running the trip and section segmentation currently takes a Very Long Time.

We also currently have several users "stuck" at stage 2. When we upgrade, those pipelines will be killed. Instead of re-running stage 1, it would be good if we could just reset stage 2

Let's modify the ./bin/debug/delete_composite_objects_and_state.py to be more general

shankari commented 1 year ago

Migration script commits are:

shankari commented 1 year ago

After the scalability improvements were deployed, the load on the database has dropped significantly

And so has utilization of the compute cluster

Couple of caveats:

this was overnight in the US, so load is likely to increase over the day, we need to check back in a couple of days
although the DB load is low, we still see long-ish (10 sec) load times for the first load after a gap, presumably after the data has aged out of the documentDB cache. Need to check on indexing and other DB optimizations.

shankari commented 1 year ago

But the pipelines (e..g stuck in draft trips) is not an issue any more, so I declare this closed.

shankari commented 1 year ago

After a few days: MicrosoftTeams-imaged5a4a6edc58ad1de788a1d57641462b11a33488eac2dcb5ca63865545edc87a1

MicrosoftTeams-image1dd273fab5b8a71186bcaa4180378407f97ae742440190b2dea7e72dbfd78b00

e-mission / e-mission-docs

No processed trips for a few days; buildup of draft trips #942