quick fixes for scalability?

shankari commented 7 years ago

Right now, the pipeline is so slow that we can only run it once a day. It would be nice to be able to run it more frequently. Let's take a quick look at what's taking the time, and see whether we can do a quick fix to speed it up.

shankari commented 6 years ago

I thought that the slowness may be because there is too much monitoring data (e.g. API calls, pipeline_time, etc). But no, the bulk of the data is real and in the background/location, background/filtered_location and background/motion_activity categories.

stats/server_api_time -> 1,776,462
statemachine/transition -> 885,091
background/location -> 11,900,375
background/motion_activity -> 15,228,251
background/filtered_location_ignore -> 7657
background/filtered_location -> 10,078,204
background/battery -> 1,892,910
config/sensor_config -> 905
config/sync_config -> 298
config/consent -> 1427
stats/server_api_error -> 29,673
stats/client_time -> 2,513,973
stats/client_nav_event -> 325,137
stats/pipeline_time -> 1,125,218
stats/client_error -> 11,780
stats/pipeline_error -> 1,453
manual/incident -> 90
manual/mode_confirm -> 118
manual/purpose_confirm -> 111

shankari commented 5 years ago

@jf87 this is the previous bug I had filed for scalability. As you can see, I didn't get very far in my checking...

I also wrote up the goals to motivate an undergrad to work on them https://github.com/e-mission/e-mission-docs/blob/master/docs/future_work/scalability.md

and he actually wrote up a trace generator that (unfortunately) needs some cleanup before it can be merged. https://github.com/e-mission/e-mission-server/pull/622

shankari commented 5 years ago

The last time I looked at this, I thought that the issue was with the database. Here's a quick check. We have ~ 46 million entries in the database. Counting them takes ~ 1.3ms, which seems reasonable.

In [243]: edb.get_timeseries_db().find().count()
Out[243]: 46416708

In [244]: %timeit edb.get_timeseries_db().find().count()
1000 loops, best of 3: 1.39 ms per loop

But although the user_id field is indexed,

    TimeSeries = _get_current_db().Stage_timeseries
    TimeSeries.create_index([("user_id", pymongo.HASHED)])

Calling distinct on it takes 43000 times as long (43 secs).

In [241]: len(edb.get_timeseries_db().distinct("user_id"))
Out[241]: 1291

In [242]: %timeit edb.get_timeseries_db().distinct("user_id")
1 loop, best of 3: 43 s per loop

shankari commented 5 years ago

The reason why I looked at the database in the first place is because of logs like this. These are the logs for a single call to read the metrics and show it in the dashboard.

2019-03-01 00:51:08,775:DEBUG:140220397770496:START POST /result/metrics/timestamp
2019-03-01 00:51:08,775:DEBUG:140220397770496:metric_list = ['duration', 'median_speed', 'count', 'distance']
2019-03-01 00:51:08,775:DEBUG:140220397770496:['duration -> <function get_duration at 0x7f87a1802950>', 'median_speed -> <function get_median_speed at 0x7f87a18029d8>', 'count -> <function get_count at 0x7f87a1802840>', 'distance -> <function get_distance at 0x7f87a18028c8>']
2019-03-01 00:51:08,775:DEBUG:140220397770496:for user None, returning timeseries <emission.storage.timeseries.aggregate_timeseries.AggregateTimeSeries object at 0x7f8776b957f0>
2019-03-01 00:51:08,775:DEBUG:140220397770496:curr_query = {'$or': [{'metadata.key': 'analysis/inferred_section'}], 'data.start_ts': {'$lte': 1551312000, '$gte': 1550188800}}, sort_key = None
2019-03-01 00:51:08,775:DEBUG:140220397770496:orig_ts_db_keys = [], analysis_ts_db_keys = ['analysis/inferred_section']
2019-03-01 00:51:53,449:DEBUG:140220397770496:finished querying values for [], count = 0
2019-03-01 00:51:53,451:DEBUG:140220397770496:finished querying values for ['analysis/inferred_section'], count = 0
2019-03-01 00:52:36,638:DEBUG:140220397770496:orig_ts_db_matches = 0, analysis_ts_db_matches = 0
2019-03-01 00:53:16,953:DEBUG:140220397770496:Found 0 results
2019-03-01 00:53:16,954:DEBUG:140220397770496:Returning entry with length 0 result
2019-03-01 00:53:16,954:INFO:140220397770496:Found no entries for user None, time_query <emission.storage.timeseries.timequery.TimeQuery object at 0x7f8776b953c8>
2019-03-01 00:53:16,954:DEBUG:140220397770496:END POST /result/metrics/timestamp  128.1795883178711

The query that we run against mongodb includes the keys metadata.key and data.start_ts, both of which should be in the index. Further, metadata.key is HASHED and we look it up directly, while data.start_ts is DESCENDING but we do a range query over it.

def _create_analysis_result_indices(tscoll):
    tscoll.create_index([("metadata.key", pymongo.HASHED)])

    # trips and sections
    tscoll.create_index([("data.start_ts", pymongo.DESCENDING)], sparse=True)

but yet, the query takes ~ 00:51:08 -> 00:51:53 = 45 seconds. Then there is another 45 second gap (maybe a second call to the database) 00:52:36 -> 00:53:16, so the overall response time is 00:51:08 -> 00:53:16 = 2mins 8 secs.

shankari commented 5 years ago

hm, but when I tried to run the queries directly against the database, it was much faster.

In [245]: %timeit edb.get_analysis_timeseries_db().find({'$or': [{'metadata.key': 'analy
     ...: sis/inferred_section'}], 'data.start_ts': {'$lte': 1551312000, '$gte': 1550188
     ...: 800}})
100 loops, best of 3: 11.3 ms per loop

In [246]: %timeit edb.get_timeseries_db().find({'$or': [{'metadata.key': 'analysis/infer
     ...: red_section'}], 'data.start_ts': {'$lte': 1551312000, '$gte': 1550188800}})
1000 loops, best of 3: 1.19 ms per loop

That is pretty bizarre because between the two lines, I literally do nothing but query the database.

        logging.debug("orig_ts_db_keys = %s, analysis_ts_db_keys = %s" %
            (orig_ts_db_keys, analysis_ts_db_keys))

        orig_ts_db_result = self._get_entries_for_timeseries(self.timeseries_db,
                                                             orig_ts_db_keys,
                                                             time_query,
                                                             geo_query,
                                                             extra_query_list,
                                                             sort_key)

        analysis_ts_db_result = self._get_entries_for_timeseries(self.analysis_timeseries_db,
                                                                 analysis_ts_db_keys,
                                                                 time_query,
                                                                 geo_query,
                                                                 extra_query_list,
                                                                 sort_key)
        logging.debug("orig_ts_db_matches = %s, analysis_ts_db_matches = %s" %
            (orig_ts_db_result.count(), analysis_ts_db_result.count()))

shankari commented 5 years ago

even if I sort (which shouldn't apply since the sort_key is None), it doesn't get much worse. Which is not surprising since there are no entries returned.

In [247]: %timeit edb.get_analysis_timeseries_db().find({'$or': [{'metadata.key': 'analy
     ...: sis/inferred_section'}], 'data.start_ts': {'$lte': 1551312000, '$gte': 1550188
     ...: 800}}).sort("metadata.write_ts")
100 loops, best of 3: 14.8 ms per loop

shankari commented 5 years ago

this is raising the hope that there is a simple fix in my implementation of the timeseries interface which can fix the really bad performance problems. But I don't have a staging server where I can copy all this data easily...

shankari commented 5 years ago

ok so although this is bad practice in general, I added some logs directly to the timeseries. And I think I know what's going on. Here are the logs without the fix, for the record - see the giant jumps in time between the log statements.

2019-03-01 05:09:47,488:DEBUG:140666554283776:finished querying values for ['manual/incident'], count = 0
2019-03-01 05:10:06,167:DEBUG:140666554283776:finished querying values for [], count = 0
2019-03-01 05:10:24,545:DEBUG:140666554283776:orig_ts_db_matches = 0, analysis_ts_db_matches = 0
2019-03-01 05:10:41,938:DEBUG:140666554283776:Returning entry with length 0 result

shankari commented 5 years ago

ok, with a one-line fix, the response time drops to ~ 150ms. And it was such a stupid reason too. @jf87 thank you for giving me enough motivation to take a closer look.

I will check in this fix now, although I think that the underlying cause for was to handle a use case that does not exist, and this can be simplified even further later.

2019-03-01 05:27:42,647:DEBUG:140164536702720:START POST /result/metrics/timestamp
2019-03-01 05:27:42,805:DEBUG:140164536702720:END POST /result/metrics/timestamp ea59084e-11d4-4076-9252-3b9a29ce35e0 0.15816450119018555

shankari commented 5 years ago

I will commit the fix tomorrow because I want to carefully compose a commit message. If anybody wants to try out the fix, it is below. And it is also very close to the height of stupidity.

--- a/emission/storage/timeseries/builtin_timeseries.py
+++ b/emission/storage/timeseries/builtin_timeseries.py
@@ -20,7 +20,7 @@ ts_enum_map = {
     esta.EntryType.ANALYSIS_TYPE: edb.get_analysis_timeseries_db()
 }

-INVALID_QUERY = {'1': '2'}
+INVALID_QUERY = {'metadata.key': 'invalid'}

 class BuiltinTimeSeries(esta.TimeSeries):
     def __init__(self, user_id):

shankari commented 5 years ago

PR for the one-liner is https://github.com/e-mission/e-mission-server/pull/638

However, the invalid query change was introduced in Aug 2018 and this issue was filed in Jul 2017 and edited in Jul 2018. so I don't think the PR fixes everything. There are probably some other issues with retrieving non-blank data.

shankari commented 4 years ago

Closing this for now as fixed by https://github.com/e-mission/e-mission-server/pull/638

e-mission / e-mission-docs

quick fixes for scalability? #261