Closed shankari closed 4 years ago
I thought that the slowness may be because there is too much monitoring data (e.g. API calls, pipeline_time, etc). But no, the bulk of the data is real and in the background/location
, background/filtered_location
and background/motion_activity
categories.
stats/server_api_time -> 1,776,462
statemachine/transition -> 885,091
background/location -> 11,900,375
background/motion_activity -> 15,228,251
background/filtered_location_ignore -> 7657
background/filtered_location -> 10,078,204
background/battery -> 1,892,910
config/sensor_config -> 905
config/sync_config -> 298
config/consent -> 1427
stats/server_api_error -> 29,673
stats/client_time -> 2,513,973
stats/client_nav_event -> 325,137
stats/pipeline_time -> 1,125,218
stats/client_error -> 11,780
stats/pipeline_error -> 1,453
manual/incident -> 90
manual/mode_confirm -> 118
manual/purpose_confirm -> 111
@jf87 this is the previous bug I had filed for scalability. As you can see, I didn't get very far in my checking...
I also wrote up the goals to motivate an undergrad to work on them https://github.com/e-mission/e-mission-docs/blob/master/docs/future_work/scalability.md
and he actually wrote up a trace generator that (unfortunately) needs some cleanup before it can be merged. https://github.com/e-mission/e-mission-server/pull/622
The last time I looked at this, I thought that the issue was with the database. Here's a quick check. We have ~ 46 million entries in the database. Counting them takes ~ 1.3ms, which seems reasonable.
In [243]: edb.get_timeseries_db().find().count()
Out[243]: 46416708
In [244]: %timeit edb.get_timeseries_db().find().count()
1000 loops, best of 3: 1.39 ms per loop
But although the user_id
field is indexed,
TimeSeries = _get_current_db().Stage_timeseries
TimeSeries.create_index([("user_id", pymongo.HASHED)])
Calling distinct
on it takes 43000 times as long (43 secs).
In [241]: len(edb.get_timeseries_db().distinct("user_id"))
Out[241]: 1291
In [242]: %timeit edb.get_timeseries_db().distinct("user_id")
1 loop, best of 3: 43 s per loop
The reason why I looked at the database in the first place is because of logs like this. These are the logs for a single call to read the metrics and show it in the dashboard.
2019-03-01 00:51:08,775:DEBUG:140220397770496:START POST /result/metrics/timestamp
2019-03-01 00:51:08,775:DEBUG:140220397770496:metric_list = ['duration', 'median_speed', 'count', 'distance']
2019-03-01 00:51:08,775:DEBUG:140220397770496:['duration -> <function get_duration at 0x7f87a1802950>', 'median_speed -> <function get_median_speed at 0x7f87a18029d8>', 'count -> <function get_count at 0x7f87a1802840>', 'distance -> <function get_distance at 0x7f87a18028c8>']
2019-03-01 00:51:08,775:DEBUG:140220397770496:for user None, returning timeseries <emission.storage.timeseries.aggregate_timeseries.AggregateTimeSeries object at 0x7f8776b957f0>
2019-03-01 00:51:08,775:DEBUG:140220397770496:curr_query = {'$or': [{'metadata.key': 'analysis/inferred_section'}], 'data.start_ts': {'$lte': 1551312000, '$gte': 1550188800}}, sort_key = None
2019-03-01 00:51:08,775:DEBUG:140220397770496:orig_ts_db_keys = [], analysis_ts_db_keys = ['analysis/inferred_section']
2019-03-01 00:51:53,449:DEBUG:140220397770496:finished querying values for [], count = 0
2019-03-01 00:51:53,451:DEBUG:140220397770496:finished querying values for ['analysis/inferred_section'], count = 0
2019-03-01 00:52:36,638:DEBUG:140220397770496:orig_ts_db_matches = 0, analysis_ts_db_matches = 0
2019-03-01 00:53:16,953:DEBUG:140220397770496:Found 0 results
2019-03-01 00:53:16,954:DEBUG:140220397770496:Returning entry with length 0 result
2019-03-01 00:53:16,954:INFO:140220397770496:Found no entries for user None, time_query <emission.storage.timeseries.timequery.TimeQuery object at 0x7f8776b953c8>
2019-03-01 00:53:16,954:DEBUG:140220397770496:END POST /result/metrics/timestamp 128.1795883178711
The query that we run against mongodb includes the keys metadata.key
and data.start_ts
, both of which should be in the index. Further, metadata.key
is HASHED
and we look it up directly, while data.start_ts
is DESCENDING
but we do a range query over it.
def _create_analysis_result_indices(tscoll):
tscoll.create_index([("metadata.key", pymongo.HASHED)])
# trips and sections
tscoll.create_index([("data.start_ts", pymongo.DESCENDING)], sparse=True)
but yet, the query takes ~ 00:51:08
-> 00:51:53
= 45 seconds. Then there is another 45 second gap (maybe a second call to the database) 00:52:36
-> 00:53:16
, so the overall response time is 00:51:08 -> 00:53:16 = 2mins 8 secs.
hm, but when I tried to run the queries directly against the database, it was much faster.
In [245]: %timeit edb.get_analysis_timeseries_db().find({'$or': [{'metadata.key': 'analy
...: sis/inferred_section'}], 'data.start_ts': {'$lte': 1551312000, '$gte': 1550188
...: 800}})
100 loops, best of 3: 11.3 ms per loop
In [246]: %timeit edb.get_timeseries_db().find({'$or': [{'metadata.key': 'analysis/infer
...: red_section'}], 'data.start_ts': {'$lte': 1551312000, '$gte': 1550188800}})
1000 loops, best of 3: 1.19 ms per loop
That is pretty bizarre because between the two lines, I literally do nothing but query the database.
logging.debug("orig_ts_db_keys = %s, analysis_ts_db_keys = %s" %
(orig_ts_db_keys, analysis_ts_db_keys))
orig_ts_db_result = self._get_entries_for_timeseries(self.timeseries_db,
orig_ts_db_keys,
time_query,
geo_query,
extra_query_list,
sort_key)
analysis_ts_db_result = self._get_entries_for_timeseries(self.analysis_timeseries_db,
analysis_ts_db_keys,
time_query,
geo_query,
extra_query_list,
sort_key)
logging.debug("orig_ts_db_matches = %s, analysis_ts_db_matches = %s" %
(orig_ts_db_result.count(), analysis_ts_db_result.count()))
even if I sort (which shouldn't apply since the sort_key is None
), it doesn't get much worse.
Which is not surprising since there are no entries returned.
In [247]: %timeit edb.get_analysis_timeseries_db().find({'$or': [{'metadata.key': 'analy
...: sis/inferred_section'}], 'data.start_ts': {'$lte': 1551312000, '$gte': 1550188
...: 800}}).sort("metadata.write_ts")
100 loops, best of 3: 14.8 ms per loop
this is raising the hope that there is a simple fix in my implementation of the timeseries interface which can fix the really bad performance problems. But I don't have a staging server where I can copy all this data easily...
ok so although this is bad practice in general, I added some logs directly to the timeseries. And I think I know what's going on. Here are the logs without the fix, for the record - see the giant jumps in time between the log statements.
2019-03-01 05:09:47,488:DEBUG:140666554283776:finished querying values for ['manual/incident'], count = 0
2019-03-01 05:10:06,167:DEBUG:140666554283776:finished querying values for [], count = 0
2019-03-01 05:10:24,545:DEBUG:140666554283776:orig_ts_db_matches = 0, analysis_ts_db_matches = 0
2019-03-01 05:10:41,938:DEBUG:140666554283776:Returning entry with length 0 result
ok, with a one-line fix, the response time drops to ~ 150ms. And it was such a stupid reason too. @jf87 thank you for giving me enough motivation to take a closer look.
I will check in this fix now, although I think that the underlying cause for was to handle a use case that does not exist, and this can be simplified even further later.
2019-03-01 05:27:42,647:DEBUG:140164536702720:START POST /result/metrics/timestamp
2019-03-01 05:27:42,805:DEBUG:140164536702720:END POST /result/metrics/timestamp ea59084e-11d4-4076-9252-3b9a29ce35e0 0.15816450119018555
I will commit the fix tomorrow because I want to carefully compose a commit message. If anybody wants to try out the fix, it is below. And it is also very close to the height of stupidity.
--- a/emission/storage/timeseries/builtin_timeseries.py
+++ b/emission/storage/timeseries/builtin_timeseries.py
@@ -20,7 +20,7 @@ ts_enum_map = {
esta.EntryType.ANALYSIS_TYPE: edb.get_analysis_timeseries_db()
}
-INVALID_QUERY = {'1': '2'}
+INVALID_QUERY = {'metadata.key': 'invalid'}
class BuiltinTimeSeries(esta.TimeSeries):
def __init__(self, user_id):
PR for the one-liner is https://github.com/e-mission/e-mission-server/pull/638
However, the invalid query change was introduced in Aug 2018 and this issue was filed in Jul 2017 and edited in Jul 2018. so I don't think the PR fixes everything. There are probably some other issues with retrieving non-blank data.
Closing this for now as fixed by https://github.com/e-mission/e-mission-server/pull/638
Right now, the pipeline is so slow that we can only run it once a day. It would be nice to be able to run it more frequently. Let's take a quick look at what's taking the time, and see whether we can do a quick fix to speed it up.