Closed JGreenlee closed 1 year ago
Local logs indicate an error is being thrown in the sync plugin:
ServerSyncAdapter : IO Error java.io.IOException while posting converted trips to JSON
, which is coming from https://github.com/e-mission/cordova-server-sync/blob/01c4e4cb7cd9bfd1b86f451d184ae187139ab7e2/src/android/ServerSyncAdapter.java#L148
This could be the cause; however other recent logs would indicate sync is still happening successfully.
I'll also note that there was significant travel in the gap between 3:44 and 8:42 PM. I remember my phone battery was low most of the day, so battery saver was probably on and off at different parts of the day.
I remember my phone battery was low most of the day, so battery saver was probably on and off at different parts of the day.
It would be good to see if this is in fact the issue. I have been testing with battery saver off and on (#940) and the trajectories may be a bit worse, but I don't see any big gaps in sensing.
To clarify, this was because the servers had been very unstable all last week and were crashing multiple times. One of the crashes must have occurred when the pipeline for this user was running. Once it crashed while running, all subsequent runs would crash because curr_run_ts
was set
I reset the pipeline and the user was able to see their trips.
This occurred again yesterday. The servers crashed at around 4:18pm and then again at around 5:14pm.
Since this is the second time this has happened, I wrote a script to iterate through all the active production deployments and check how many invalid pipeline states we had.
import os
import re
import copy
import json
storage_conf_template = json.load(open("conf/storage/db.conf"))
print(storage_conf_template.keys())
for prod in PROD_LIST:
prod_db_name = prod.replace("-", "_")
print(f"Resetting pipeline for {prod=} with {prod_db_name=}")
curr_storage_conf = copy.deepcopy(storage_conf_template)
curr_storage_conf['timeseries']['url'] = curr_storage_conf['timeseries']['url'].replace("REPLACEME", prod_db_name)
print(f"After resetting, {storage_conf_template=}, {curr_storage_conf=}")
json.dump(curr_storage_conf, open("conf/storage/db.conf", "w"))
os.system("echo conf/storage.db.conf")
check_reset_cmd = "./e-mission-py.bash bin/monitor/reset_invalid_pipeline_states.py -c"
os.system(check_reset_cmd)
json.dump(storage_conf_template, open("conf/storage/db.conf", "w"))
Number of users that need to be reset varies from ~ 1 to ~ 5. Starting with the ones in an attempt to spread out the load. I wish it were possible to check the CPU load directly..
Cloud Services indicated that the issue was related to high CPU load. Let us look at some of the long tasks that they flagged. One of the long tasks was from denver_casr
Looking at the logs for one hour (attached), we notice that:
$ wc -l /tmp/one_hour_of_denver_casr
126 /tmp/one_hour_of_denver_casr
However, 94 of them did not process any data
$ grep unchanged /tmp/one_hour_of_denver_casr | wc -l
94
and only 31 actually had data to process
$ grep "last_ts_processed = " /tmp/one_hour_of_denver_casr | wc -l
31
Even the ones that did not have data to process took about 30 seconds each
"2023-08-10 23:41:35,463:INFO:139958568953664:For stage PipelineStages.USERCACHE, last_ts_processed is unchanged"
"2023-08-10 23:42:23,039:INFO:139958568953664:For stage PipelineStages.USERCACHE, last_ts_processed is unchanged"
"2023-08-10 23:42:58,393:INFO:139958568953664:For stage PipelineStages.USERCACHE, last_ts_processed is unchanged"
"2023-08-10 23:43:19,324:INFO:139958568953664:For stage PipelineStages.USERCACHE, last_ts_processed is unchanged"
"2023-08-10 23:43:52,610:INFO:140488176322368:For stage PipelineStages.USERCACHE, last_ts_processed is unchanged"
"2023-08-10 23:44:05,465:INFO:139958568953664:For stage PipelineStages.USERCACHE, last_ts_processed is unchanged"
But that means that it must have taken ~ 47 minutes just to process and skip those users with no data
>>> (30 * 94) / 60
47
A simple fix seems to be that, if the user has no incoming data (e.g. nothing moved to long_term), we can skip the rest of the pipeline.
The problem with that approach is that then, if there are downstream pipeline stages that have not processed fully - e.g. due to a crash, and then we reset the pipeline, the downstream stages will not be processed until new data comes in. And if there is no new data coming in (the user has uninstalled the app), then we will never reprocess the data.
However, it seems like a fairly simple fix is to pass in a flag indicating whether to bail out early or not. We can set it to true (bail out early) on production for now, and consider running with the flag turned off (don't bail early) on a separate "catch up" AWS instance. Or turn the flag off after we discuss other options for scalability.... Or...
This is a super simple fix which I will plan on pushing to staging tomorrow
While we are looking at early returns from the pipeline, let's see if we can do something similar for the trip model as well. If we look at the Durham logs, for example, the pipeline ran for 30 mins even though there was no actual data coming in.
2023-08-10 02:00:52,228
2023-08-10 02:24:41,112
The current trip model pipeline is not incremental, so we cannot look to see if there are "new" trips. However, it is also run once a day, so if the last trip is more than a day old (or ~ 3 days old to add in some margin) we can do an early return.
The logs for Durham are:
2023-08-10 02:00:52,228:INFO:building model for user
2023-08-10 02:00:52,228:DEBUG:getting key model_type in config
2023-08-10 02:00:52,228:DEBUG:getting key model_storage in config
2023-08-10 02:00:52,228:DEBUG:getting key minimum_trips in config
2023-08-10 02:00:52,228:DEBUG:getting key model_parameters.greedy in config
2023-08-10 02:00:52,228:DEBUG:GreedySimilarityBinning loaded model config from file
2023-08-10 02:00:54,309:DEBUG:retrieved latest trip model recorded at timestamp 1691546460.3530114
2023-08-10 02:00:54,369:INFO:For stage PipelineStages.TRIP_MODEL, start_ts = 2022-11-01T23:29:05.438816
2023-08-10 02:00:54,389:DEBUG:time query for training data collection: None
2023-08-10 02:00:54,993:DEBUG:found 38 labeled trips for user
2023-08-10 02:00:56,036:DEBUG:upsert_doc called with key inference/trip_model
2023-08-10 02:00:56,292:DEBUG:GREEDY_SIMILARITY_BINNING label prediction model built for user with timestamp 1667345340.438816
2023-08-10 02:00:56,311:INFO:building model for user ....
So the run took ~ 4 seconds and ~ 2 seconds were spent in loading the model. Unfortunately, we load the model before loading the trips.
So we have two choices:
ts.get_first_value_for_field
)The first feels more hacky, but is also likely to be more performant, especially for longer data collection durations, since we are not currently using an incremental model. Once we switch to an incremental model, we might want to implement (2) since it is more principled and less hacky.
@humbleOldSage for visibility and to keep in mind while switching models in the future.
Cloud services bumped up the server size, so I reset everything and reran. As of yesterday, there were just some slow-running pipelines for masscec and open access.
But today, we ran into a new entry in ccebikes
which was stuck at the USER_INPUT
stage.
On checking, it really was taking a lot of time to process the user inputs (I guess the user labeled a lot of trips at once?) I was monitoring the run, and noticed suddenly that the task seemed to hang/terminated. So I assumed that the tasks were terminating randomly.
However, there were a few more logs when I came back to take a look at it, and it looked like the saves were taking ~ 20 minutes each
1691861491149,"2023-08-12 17:31:31,149:DEBUG:139623946450752:update called"
1691861491149,"2023-08-12 17:31:31,149:DEBUG:139623946450752:Saving entry Entry({'_id': ObjectId('64d06524fc6a934b4ab56230'), 'metadata': {'key': 'analysis/confirmed_trip',
1691861981828,"2023-08-12 17:39:41,825:DEBUG:139623946450752:update called"
1691861981828,"2023-08-12 17:39:41,828:DEBUG:139623946450752:Saving entry Entry({'_id': ObjectId('64d06598fc6a934b4ab56234'), 'metadata': {'key': 'analysis/composite_trip',
1691861982575,"2023-08-12 17:39:42,575:DEBUG:139623946450752:curr_query = {'invalid': {'$exists': False}, ...
The really weird thing is that the confirmed trip saves were taking 20 mins, but the subsequent composite trip saves were taking almost no time. What are we doing between those two calls?
1691863387241,"2023-08-12 18:03:07,240:DEBUG:139623946450752:update called"
1691863387241,"2023-08-12 18:03:07,240:DEBUG:139623946450752:Saving entry Entry({'_id': ObjectId('64ce9c41618d03e1e6a41e95'), 'user_id': UUID('1355e661-e7fa-4450-aae8-dd828fca35e6'), 'metadata': {'key': 'analysis/confirmed_trip',
1691864075195,"2023-08-12 18:14:35,195:DEBUG:139623946450752:update called"
1691864075197,"2023-08-12 18:14:35,197:DEBUG:139623946450752:Saving entry Entry({'_id': ObjectId('64ce9dac618d03e1e6a41e98'), 'metadata': {'key': 'analysis/composite_trip',
1691864075717,"2023-08-12 18:14:35,717:DEBUG:139623946450752:curr_query = {'invalid': {'$exists': False}, ...
So the update and saving calls happen together
logging.debug("update called")
ts = esta.TimeSeries.get_time_series(entry.user_id)
logging.debug("Saving entry %s into timeseries" % entry)
edb.save(ts.get_timeseries_db(entry.metadata.key), entry)
I assume that the save is not slow in general, since the composite trip call is fast, although the composite trip is a bigger object. We need to see what (if anything) we are doing between the two saves.
Bingo! Here's the code.
We first update the confirmed object, and then we find the composite trip by find the trip for which this is the confirmed_trip.
estbt.BuiltinTimeSeries.update(confirmed_obj)
# we update confirmed object in the pipeline, but they are no longer the terminal data structures
# that we display. So when the confirmed objects change, we need to find the corresponding
# terminal objects (composite trips) and update them as well
# if we don't find a matching composite trip, we don't need to do anything
# since it has not been created yet and will be created with updated values when we get to that stage
if confirmed_obj["metadata"]["key"] in [esda.CONFIRMED_TRIP_KEY, esda.CONFIRMED_UNTRACKED_KEY]:
composite_trip = edb.get_analysis_timeseries_db().find_one({"data.confirmed_trip": confirmed_obj.get_id()})
if composite_trip is not None:
# copy over all the fields other than the end_confimed_place
EXCLUDED_FIELDS = ["end_confirmed_place"]
for k in confirmed_obj["data"].keys():
if k not in EXCLUDED_FIELDS:
composite_trip["data"][k] = confirmed_obj["data"][k]
estbt.BuiltinTimeSeries.update(ecwe.Entry(composite_trip))
else:
logging.debug("No composite trip matching confirmed trip %s, nothing to update" % confirmed_obj["_id"])
But there is no index on the field data.confirmed_trip
!!
So that query is taking 20 mins each on some of the databases
def _create_analysis_result_indices(tscoll):
tscoll.create_index([("metadata.key", pymongo.ASCENDING)])
# trips and sections
tscoll.create_index([("data.start_ts", pymongo.DESCENDING)], sparse=True)
tscoll.create_index([("data.end_ts", pymongo.DESCENDING)], sparse=True)
_migrate_sparse_to_dense(tscoll, "data.start_loc_2dsphere")
_migrate_sparse_to_dense(tscoll, "data.end_loc_2dsphere")
tscoll.create_index([("data.start_loc", pymongo.GEOSPHERE)])
tscoll.create_index([("data.end_loc", pymongo.GEOSPHERE)])
_create_local_dt_indices(tscoll, "data.start_local_dt")
_create_local_dt_indices(tscoll, "data.end_local_dt")
@JGreenlee for visibility
that stage finally completed and the thread has moved on. It doesn't show up as needing reset at this point.
There are a couple of potential fixes for the above.
data.confirmed_trip
I noticed also that several of the users are stuck in stage 2 (section segmentation), not just trip segmentation Checking the logs, it looks like several calls which involve indices are taking multiple minutes to run
2023-08-13 16:28:15,503:DEBUG:139939849852736:curr_query = {'invalid': {'$exists': False}, 'user_id': UUID(...), '$or': [{'metadata.key': 'background/motion_activity'}], 'data.ts': {'$lte': 1671926016.0008862, '$gte': 1671925742.2325292}}, sort_key = data.ts
2023-08-13 16:31:20,377:DEBUG:139939849852736:curr activity = Motionactivity({'unknown': False, 'cycling': False, 'automotive': False, 'ts': 1671925801.0545607, 'confidence': 100, 'stationary': False, 'walking': False, 'running': False, 'type': 9, 'confidence_level': 'high', 'fmt_time': '2022-12-24T18:50:01.054561-05:00', 'local_dt_year': 2022, 'local_dt_month': 12, 'local_dt_day': 24, 'local_dt_hour': 18, 'local_dt_minute': 50, 'local_dt_second': 1, 'local_dt_weekday': 5, 'local_dt_timezone': 'America/New_York', '_id': ObjectId('63a7910680ea0c211d6aeb81'), 'user_id': ..., 'metadata_write_ts': 1671925801.0545607}), returning False
Because of this, for a particular masscec user, although the trip segmentation is at July 2023, the section segmentation is still stuck at September 2022
Even looking up one entry by the timestamp, when the data.ts
is indexed, is taking 30 seconds
2023-08-13 16:54:14,216:DEBUG:140481331697472:get_entry_at_ts query = {'user_id': UUID('...'), 'metadata.key': 'background/filtered_location', 'data.ts': 1663716286.000039}
2023-08-13 16:54:35,776:DEBUG:140481331697472:get_entry_at_ts result = {'_id': ObjectId('63321b30aa2dde12ed3db589'), 'user_id': UUID('...'), 'metadata': {'time_zone': 'America/New_York', 'plugin': 'none', 'platform': 'ios', 'read_ts': 0, 'key': 'background/filtered_location', 'type': 'sensor-data',...
Documenting design decisions:
for https://github.com/e-mission/e-mission-docs/issues/942#issuecomment-1674108246 if there are no new entries, we don't modify the pipelinestate at all. That seems fairly consistent and neat...
So if the pipeline state didn't exist before, it won't be created.
And the last_run_ts
won't change.
wrt
We can also load all composite trips for the specified time range, and then just match in memory. (1) seems easier at first glance, but I think that (2) is actually better because we should generally optimize timeseries accesses over field-based links
from https://github.com/e-mission/e-mission-docs/issues/942#issuecomment-1676108597, this is a bit tricky because the range of affected confirmed/composite trips can be anything - people can go back and label trips for a year ago today.
So we can fix this by first determining the confirmed trips that are affected, and then pulling all the composite trips in that time range and matching them up. This can still have bad performance for a user who (a) has a lot of trips, and (b) labels trips across a wide range (e.g. one trip a year ago and one trip from a week ago)
An alternative is to find the composite trip using the start_ts
and not the id - the start_ts
for the composite trip is guaranteed to be the same as the start_ts
for the confirmed trip.
Let's go with the second approach. It does not have the problem with widely separated trips, and it does not change the code structure. It simply replaces a call on a non-indexed field with a call on an indexed field.
Digression to investigate some scalability improvements from https://github.com/e-mission/e-mission-docs/issues/948
Notably, we frequently take ~ 10 seconds for the first call to complete_ts
1691900798386,"2023-08-13 04:26:38,385:DEBUG:139761210832448:START POST /pipeline/get_range_ts"
1691900798386,"2023-08-13 04:26:38,386:DEBUG:139761210832448:methodName = skip, returning <class 'emission.net.auth.skip.SkipMethod'>"
1691900798386,"2023-08-13 04:26:38,386:DEBUG:139761210832448:Using the skip method to verify id token ... of length 36"
1691900798394,"2023-08-13 04:26:38,394:DEBUG:139761210832448:retUUID = ...
1691900808505,"2023-08-13 04:26:48,505:DEBUG:139761210832448:Returning complete_ts = 1691889354.233"
1691900808505,"2023-08-13 04:26:48,505:DEBUG:139761210832448:Returning range (1659322488.216, 1691889349.233)"
1691900808505,"2023-08-13 04:26:48,505:DEBUG:139761210832448:END POST /pipeline/get_range_ts ... 10.11975884437561 "
although subsequent calls are faster.
1691900834916,"2023-08-13 04:27:14,916:DEBUG:139761200342592:START POST /pipeline/get_range_ts"
1691900835215,"2023-08-13 04:27:15,215:DEBUG:139761200342592:Returning complete_ts = 1691889354.233"
1691900835215,"2023-08-13 04:27:15,215:DEBUG:139761200342592:Returning range (1659322488.216, 1691889349.233)"
1691900835215,"2023-08-13 04:27:15,215:DEBUG:139761200342592:END POST /pipeline/get_range_ts"
This is super important since this is the first call for every retrieval and having this take 10 seconds is not really acceptable.
I had originally assumed that this was from the complete_ts
call, which made no sense, since it pulls that data directly from the pipeline states.
complete_ts = esp.get_complete_ts(user_id)
However, I now see that we first get the range from the trips and then get the complete_ts, so it could be either, and is far more likely to be the first one
ts = esta.TimeSeries.get_time_series(user_id)
start_ts = ts.get_first_value_for_field("analysis/composite_trip", "data.start_ts", pymongo.ASCENDING)
...
end_ts = ts.get_first_value_for_field("analysis/composite_trip", "data.end_ts", pymongo.DESCENDING)
...
complete_ts = get_complete_ts(user_id)
if this is causing the slowdown, and the DocumentDB team is not able to fix it, we should probably fix by pre-caching the pipeline range at the end of the pipeline run.
@JGreenlee for visibility
On a related note, and looking at https://github.com/e-mission/e-mission-docs/issues/942#issuecomment-1676410882, where we were taking 30 seconds for a single lookup by an indexed field (data.ts
), I wrote a simple script to print out some basic statistics about DB sizes and access times (attached).
I then ran it against all the production databases by hacking/expanding https://github.com/e-mission/e-mission-docs/issues/942#issuecomment-1674063739
The result csv is below, and attached
program | raw_timeseries_size | analysed_timeseries_size | len_entries_in_range | retrieve_location_range_time | retrieve_single_entry_with_id_time | retrieve_single_entry_with_ts_time | |
---|---|---|---|---|---|---|---|
0 | open_access | 30634701 | 1841576 | 42362 | 315.825 | 0.017324 | 11.5839 |
1 | nrel_commute | 9115634 | 507854 | 6754 | 98.9754 | 0.0117631 | 27.0977 |
2 | mm_masscec | 22649423 | 966169 | 70780 | 221.948 | 0.0120975 | 152.7 |
3 | denver_casr | 10273935 | 1063449 | 259431 | 255.146 | 0.163944 | 1.2389 |
4 | smart_commute_ebike | 2860865 | 264346 | 38506 | 51.0785 | 0.132923 | 1.36704 |
5 | uprm_nicr | 5272631 | 264194 | 31674 | 93.4192 | 0.151773 | 11.4744 |
6 | ebikes_for_essentials | 2471371 | 268972 | 18151 | 35.7896 | 0.0820957 | 2.86416 |
7 | cortezebikes | 1643523 | 215805 | 18645 | 32.3265 | 0.0976193 | 3.7595 |
8 | ccebikes | 16812728 | 1226622 | 325829 | 281.777 | 0.0258702 | 0.816994 |
9 | ebikegj | 5614952 | 437830 | 153270 | 107.633 | 0.0172236 | 2.79284 |
10 | choose_your_ride | 1730103 | 201746 | 42847 | 25.191 | 0.0126362 | 0.524229 |
11 | ebikethere_garfield_county | 4587180 | 422702 | 117008 | 68.4168 | 0.173241 | 1.49514 |
12 | ca_ebike | 563138 | 57894 | 17752 | 8.41705 | 0.00962978 | 0.582538 |
13 | fortmorgan | 521101 | 75260 | 11502 | 5.21246 | 0.0204601 | 0.780257 |
14 | usaid_laos_ev | 186668 | 14350 | 3360 | 1.3242 | 0.0217177 | 1.44233 |
scalability_results_round_1.csv program_size_and_query_time.py.gz
Here's the same result in graph form We can see that:
data.ts
key. So what's going on? Is the index not set up? Not working?masscec
, the time taken to retrieve a single entry with data.ts
is 150 seconds (> 2 minutes!!) even though it has a lot fewer entries than (say) open_access. Note that to get the retrieve_location_range_time
, I pick a random entry and try to re-find it by using its timestamp. But it does show the response time does not appear to be purely related to data size.Following up on https://github.com/e-mission/e-mission-docs/issues/942#issuecomment-1677846253 I added in the time to find the first trip and the last trip for each program
New results are:
program | raw_timeseries_size | analysed_timeseries_size | len_entries_in_range | retrieve_location_range_time | retrieve_single_entry_with_id_time | retrieve_single_entry_with_ts_time | retrieve_first_trip_for_user_time | retrieve_last_trip_for_user_time | retrieve_complete_ts_for_user_time | |
---|---|---|---|---|---|---|---|---|---|---|
0 | open_access | 30640974 | 1860401 | 42362 | 332.034 | 0.0159355 | 3.60054 | 4.22115 | 0.0400267 | 0.0332904 |
1 | nrel_commute | 9119362 | 508326 | 6754 | 106.926 | 0.0932374 | 21.7889 | 9.92843 | 0.0943718 | 0.0137558 |
2 | mm_masscec | 22659958 | 966607 | 70780 | 271.97 | 0.0239395 | 24.9989 | 3.01057 | 0.0608934 | 0.0427971 |
3 | denver_casr | 10298147 | 1066292 | 259431 | 128.11 | 0.0913577 | 0.845236 | 1.60049 | 0.0179492 | 0.0594237 |
4 | smart_commute_ebike | 2862420 | 264853 | 38506 | 33.4999 | 0.223669 | 6.98352 | 7.37743 | 0.0492139 | 0.033423 |
5 | uprm_nicr | 5273248 | 264194 | 31674 | 49.5853 | 0.108283 | 1.48983 | 2.37829 | 0.0183116 | 0.037815 |
6 | ebikes_for_essentials | 2482810 | 269674 | 18151 | 28.2285 | 0.051417 | 1.06774 | 1.12528 | 0.015482 | 0.0478389 |
7 | cortezebikes | 1648025 | 216873 | 18645 | 22.0812 | 0.0126098 | 0.971451 | 4.94458 | 0.0265731 | 0.0451303 |
8 | ccebikes | 16840737 | 1228313 | 325829 | 227.473 | 0.0124022 | 1.46591 | 3.92774 | 0.0801908 | 0.0451017 |
9 | ebikegj | 5628333 | 440100 | 153270 | 87.0848 | 0.0145204 | 2.19927 | 1.78379 | 0.0193664 | 0.04728 |
10 | choose_your_ride | 1744706 | 203427 | 42847 | 19.7551 | 0.0862026 | 0.310989 | 0.86218 | 0.0516913 | 0.0433836 |
11 | ebikethere_garfield_county | 4595259 | 423523 | 117008 | 56.8704 | 0.114763 | 0.547563 | 4.08938 | 0.0420983 | 0.046549 |
12 | ca_ebike | 563410 | 57894 | 17752 | 6.69122 | 0.0187379 | 1.55731 | 1.56463 | 0.0285704 | 0.0411629 |
13 | fortmorgan | 522200 | 75545 | 11502 | 4.46128 | 0.0150901 | 2.60688 | 2.51811 | 0.0274105 | 0.0387707 |
14 | usaid_laos_ev | 186902 | 14350 | 3360 | 0.894163 | 0.0133026 | 0.362137 | 0.448559 | 0.00895064 | 0.014204 |
New results are:
Notably:
nrel_commute
)We use the pipeline range_ts even for the current production version, so it is critical to get this out into production ASAP
given that retrieve_single_entry_with_ts
is taking ~ 150 secs = 2 mins sometimes, I wondered if it is worthwhile to replace the implementation. Then I checked, at it appears to be taking 10 minutes without the index. Ok, so we do need it 😄
https://github.com/e-mission/e-mission-docs/issues/942#issuecomment-1676052328
1691861491149,"2023-08-12 17:31:31,149:DEBUG:139623946450752:update called"
1691861491149,"2023-08-12 17:31:31,149:DEBUG:139623946450752:Saving entry Entry({'_id': ObjectId('64d06524fc6a934b4ab56230'), 'metadata': {'key': 'analysis/confirmed_trip',
1691861981828,"2023-08-12 17:39:41,825:DEBUG:139623946450752:update called"
1691861981828,"2023-08-12 17:39:41,828:DEBUG:139623946450752:Saving entry Entry({'_id': ObjectId('64d06598fc6a934b4ab56234'), 'metadata': {'key': 'analysis/composite_trip',
1691861982575,"2023-08-12 17:39:42,575:DEBUG:139623946450752:curr_query = {'invalid': {'$exists': False}, ...
So this change https://github.com/e-mission/e-mission-docs/issues/942#issuecomment-1677776874 broke the unit tests since we copy over all entries (including the start and end timestamps) when updating composite with confirmed. So if the confirmed trip start_ts
has been modified, then the composite trip will no longer match.
composite_trip = ts.get_entry_at_ts("analysis/composite_trip", "data.start_ts", confirmed_obj.data.start_ts)
if composite_trip is not None:
# copy over all the fields other than the end_confimed_place
EXCLUDED_FIELDS = ["end_confirmed_place"]
for k in confirmed_obj["data"].keys():
if k not in EXCLUDED_FIELDS:
composite_trip["data"][k] = confirmed_obj["data"][k]
estbt.BuiltinTimeSeries.update(ecwe.Entry(composite_trip))
There are two potential fixes:
As far as I can make out, grep -r update_confirmed_and_composite emission
only returns two call sites, both of which are in
emission/analysis/userinput/matcher.py
, and one of them is purely focused on the last_trip. In the other, we fill in the single and multiple inputs for the trip. So why do we need to copy all the data over? Why not just the user inputs?
for ui in toMatchInputs:
confirmed_obj = esdt.get_confirmed_obj_for_user_input_obj(ts, ui)
if confirmed_obj is not None:
if ui.metadata.key in single_most_recent_match_key_list:
handle_single_most_recent_match(confirmed_obj, ui)
elif ui.metadata.key in multi_non_deleted_key_list:
handle_multi_non_deleted_match(confirmed_obj, ui)
else:
assert False, "Found weird key {ui.metadata.key} that was not in the search list"
update_confirmed_and_composite(ts, confirmed_obj)
I looked up the commit, and it was indeed designed to only fill in the user inputs and additions https://github.com/e-mission/e-mission-server/commit/149025b12c66e3b9708dfd505bc670a08d06fd69
@JGreenlee can you think of any other fields that need to be updated? if not, I will change the unit test, finish this change and move on to the pipeline range fix
As far as I know, those are the only things that can be changed. Unless we plan to support trip editing anytime soon ;)
Unless inferences can change?
Where do we store the pipeline range?
The obvious option would be to store it in the pipeline_states collection because it represents the range of the pipeline states. But it won't actually be in the format of the PipelineStage
with curr_run_ts
etc. mongo doesn't care about this, but future databases, and the export, might
What we really want to do is to store one pair per user and update it as the pipeline runs. Where we we store the per-user updateable information? In the profile! So an alternate option would be to store it in the profile.
I vastly prefer (2) because of consistency. We can still support the fallbacks (composite -> confirmed -> cleaned) in that case by updating the profile as each stage of the pipeline completes.
Unless inferences can change?
We make one pass through the pipeline, filling in inferences. So once we have filled in inferences, we will not change them in general. if we changed inferences now, we would not account for them, since we are not calling update_confirmed_and_composite
after any changes.
I did note, while adding the section summaries, that there is a current hole in which:
But when we deal with by handling the downstream propagation to the confirmed trips, we can also deal with further propagation to composite trips - it will just be one more set of entries to copy.
@JGreenlee final question on pipeline range - we currently have a backup solution, where if we don't find composite trips, we fall back to confirmed trips and if we don't find those, we fall back to cleaned trips.
I added this code in https://github.com/e-mission/e-mission-server/commit/014df2dbe905feb18766d379442427ff26d93132 and I basically had it in mind as a backwards compat solution - if any community members have upgraded the client but not upgraded the server, it would be helpful to support them as well. And we weren't really doing anything super fancy with confirmed trips at the time, so cleaned trip would display in the UI just fine.
But with the new version of the client, we only support composite trips, so they have to upgrade the server as well. So I'm planning to remove the fallbacks.
Thoughts?
I am fine with that. At least on the phone, it would be pretty burdensome to support cleaned or confirmed objects. And for type safety, I think it's better to be more or less deterministic as to what kind of objects are being received from the server
I think it's ok to simplify this and make maintenance easier. If anyone in the community still wants to use an old version of the server, they can do so with a the pre-React phone code (or their own pre-React fork)
Also removing the check against the complete_ts
since it was a check while adding the new method, but the check has not been invoked even once in ~ 1.5 years of the open_access
and nrel_commute
programs (I am not sure where the search through the logs is, but it has been running for a few hours and we haven't found any matches.
Finally, there are other places where we look up entries that are not indexed.
Notably, in the newly added get_section_summary
, we call emission/storage/decorations/trip_queries.py
's get_sections_for_trip
which searches by trip_id
, which is not indexed.
def get_sections_for_trip(key, user_id, trip_id):
# type: (UUID, object_id) -> list(sections)
"""
Get the set of sections that are children of this trip.
"""
query = {"user_id": user_id, "data.trip_id": trip_id,
"metadata.key": key}
logging.debug("About to execute query %s with sort_key %s" % (query, "data.start_ts"))
section_doc_cursor = edb.get_analysis_timeseries_db().find(query).sort(
"data.start_ts", pymongo.ASCENDING)
return [ecwe.Entry(doc) for doc in section_doc_cursor]
However, we don't have any logs to determine whether this is slow, and to quantify the improvement (if any) of the change. So I am adding logs for now, and will change it later if it turns out to be an issue, especially for people with lots of trips.
Finally, steps to deploy the changes.
OUTPUT_GEN
stage was failing for some users and so their pipeline range was not getting updatedNow, I'm getting an error while creating composite trips for 13 of the users.
There are two possible fixes:
The second option seems better to avoid the gap, but we do still need to decide what to do about the weird composite trip errors.
We have an existing script called
./bin/debug/delete_composite_objects_and_state.py
which takes in a list of uuids
Let's call that on the 13 users and re-run the pipeline
If that works, we can run the pipeline on all production systems overnight and upgrade tomorrow.
Re-running the pipeline didn't help, since the error is that the starting_trip
field is set, but the related object is not found - it is a cleaned trip. There is also apparently a confirmed place which does not have a duration
Error while matching incoming user inputs, timestamp is unchanged
Traceback (most recent call last):
File "/home/kshankar/e-mission-server/emission/analysis/userinput/matcher.py", line 23, in match_incoming_user_inputs
last_user_input_done = match_incoming_inputs(user_id, time_query)
File "/home/kshankar/e-mission-server/emission/analysis/userinput/matcher.py", line 45, in match_incoming_inputs
confirmed_obj = esdt.get_confirmed_obj_for_user_input_obj(ts, ui)
File "/home/kshankar/e-mission-server/emission/storage/decorations/trip_queries.py", line 293, in get_confirmed_obj_for_user_input_obj
return final_candidate(valid_timeline_entry(ts, ui_obj), potential_candidates)
File "/home/kshankar/e-mission-server/emission/storage/decorations/trip_queries.py", line 177, in final_candidate
extra_filtered_potential_candidates = list(filter(filter_fn, potential_candidate_objects))
File "/home/kshankar/e-mission-server/emission/storage/decorations/trip_queries.py", line 262, in curried
return valid_user_input_for_timeline_entry(ts, confirmed_obj, user_input)
File "/home/kshankar/e-mission-server/emission/storage/decorations/trip_queries.py", line 166, in valid_user_input_for_timeline_entry
(overlapDuration, tl_entry.data.duration, (overlapDuration / tl_entry.data.duration)));
File "/home/kshankar/e-mission-server/emission/core/wrapper/wrapperbase.py", line 71, in __getattr__
raise AttributeError(
AttributeError: 'Confirmedplace' instance has no attribute 'duration'
Writing a similar function for confirmed objects so that we can delete and recreate them only
Wait a minute - for the forward migration, we don't need to run the full, non-skipped pipeline, which can take several hours even on staging. We just need a migration script that calls _get_and_store_range
for every user.
CREATE_COMPOSITE_OBJECTS
is doneThat was easy! The pipeline seems to have other broken Confirmed objects, so will update the reset scripts and fix those separately. But at least with this, we can get the pipeline ranges fixed and move on.
For the broken confirmed objects, we could simply reset the pipeline, but that will reset everything, since we have no/broken confirmed places. And running the trip and section segmentation currently takes a Very Long Time.
We also currently have several users "stuck" at stage 2. When we upgrade, those pipelines will be killed. Instead of re-running stage 1, it would be good if we could just reset stage 2
Let's modify the ./bin/debug/delete_composite_objects_and_state.py
to be more general
Migration script commits are:
After the scalability improvements were deployed, the load on the database has dropped significantly
And so has utilization of the compute cluster
Couple of caveats:
But the pipelines (e..g stuck in draft trips) is not an issue any more, so I declare this closed.
After a few days:
I had a busy day of travel on Saturday (3 days ago).
My last trip showing up as processed was 3:34-3:44 PM. All trips from that point onward (approx. 30) are draft trips.
Processing should happen once every hour and we should never see a buildup of unprocessed travel this long. This could be a sync issue on the phone, or it could be a processing issue on the server.