Split the server for greater scalability

e-mission / e-mission-docs

Repository for docs and issues. If you need help, please file an issue here. Public conversations are better for open source projects than private email.

https://e-mission.readthedocs.io/en/latest

BSD 3-Clause "New" or "Revised" License

15 stars 32 forks source link

Split the server for greater scalability #292

Closed shankari closed 5 years ago

shankari commented 6 years ago

The server scalability had deteriorated to the point where we were unable to run the pipeline even once per day. While part of this is probably just the way we are using mongodb, part of it is also that the server resources are running out.

So I turned off the pipeline around a month ago (last run was on 2017-10-24 21:41:18).

Now, I want to re-provision with a better, split architecture, and reserved instances for lower costs.

shankari commented 6 years ago

I can specify a filter for the resources, but it is not clear how complicated the regex can be. Let's try op - that seems to match all the compute resources:

aws-op-*,
security groups: open-,
volumes em-open
VPC aws-op-vpc

shankari commented 6 years ago

Loaded public data using

$ ./e-mission-py.bash bin/debug/load_multi_timeline_for_range.py -d 1 -v 10000  -p interscity-user /code/transfer/odc-usp-2017/user 2>&1 | tee /code/transfer/load_odc-usp.log
$ ./e-mission-py.bash bin/debug/load_multi_timeline_for_range.py -d 1 -v 10000  -p test-phone-user /code/transfer/public_data/dump 2>&1 | tee /code/transfer/load_test_phones.log

shankari commented 6 years ago

While running the analysis pipeline, encountered

Traceback (most recent call last):
  File "/Users/shankari/e-mission/e-mission-server/emission/analysis/intake/cleaning/clean_and_resample.py", line 118, in save_cleaned_segments_for_timeline
    filtered_trip = get_filtered_trip(ts, trip)
  File "/Users/shankari/e-mission/e-mission-server/emission/analysis/intake/cleaning/clean_and_resample.py", line 184, in get_filtered_trip
    (filtered_section_entry, point_list) = get_filtered_section(filtered_trip_entry, section)
  File "/Users/shankari/e-mission/e-mission-server/emission/analysis/intake/cleaning/clean_and_resample.py", line 294, in get_filtered_section
    overridden_mode = get_overriden_mode(section.data, filtered_section_data, with_speeds_df)
  File "/Users/shankari/e-mission/e-mission-server/emission/analysis/intake/cleaning/clean_and_resample.py", line 496, in get_overriden_mode
    (raw_section_data.trip_id, cleaned_section_data.trip_id))

shankari commented 6 years ago

Next error

Cleaning and resampling failed for user 04bc05b3-4c2d-4138-bfbb-7eb45dce709a
Traceback (most recent call last):
  File "/Users/shankari/e-mission/e-mission-server/emission/analysis/intake/cleaning/clean_and_resample.py", line 81, in clean_and_resample
    last_raw_place = save_cleaned_segments_for_ts(user_id, time_query.startTs, time_query.endTs)
  File "/Users/shankari/e-mission/e-mission-server/emission/analysis/intake/cleaning/clean_and_resample.py", line 107, in save_cleaned_segments_for_ts
    return save_cleaned_segments_for_timeline(user_id, tl)
  File "/Users/shankari/e-mission/e-mission-server/emission/analysis/intake/cleaning/clean_and_resample.py", line 133, in save_cleaned_segments_for_timeline
    (last_cleaned_place, filtered_tl) = create_and_link_timeline(tl, user_id, trip_map)
  File "/Users/shankari/e-mission/e-mission-server/emission/analysis/intake/cleaning/clean_and_resample.py", line 909, in create_and_link_timeline
    link_trip_start(ts, curr_cleaned_trip, curr_cleaned_start_place, raw_start_place)
  File "/Users/shankari/e-mission/e-mission-server/emission/analysis/intake/cleaning/clean_and_resample.py", line 963, in link_trip_start
    ts, cleaned_trip_data, cleaned_start_place_data)
  File "/Users/shankari/e-mission/e-mission-server/emission/analysis/intake/cleaning/clean_and_resample.py", line 1101, in _fix_squished_place_mismatch
    assert False
AssertionError

On further investigation, the mismatch is an extra 0 at the end. That means that we have an extra reconstructed location when we re-query. Investigating further...

[0.0, 0.7874599418825768, 1.6700112310347082, 1.6700083407891528, 2.8850795868410497, 3.612041604838898, 3.490672881889288, 0.13185994542031884, 2.2685465126699547, 3.7544348542123562, 6.241591837653119, 6.513255262776884, 4.303879054109581, 1.0062625290710587, 11.85287083926253, 3.796945102876728, 4.4811855331967525, 4.9256719058025284, 5.33856586893377, 18.01958238254823, 15.337483277942944, 9.817315391362875, 0.11879683970093337, 1.6133514703039964, 5.981594572128506, 5.297948876043496, 2.0231189531744405, 0.06495441900389091, 0.010212308476347307, 0.008375715841469174, 0.1576887649267426, 0.8043412604987412, 0.0]
[0.0, 0.7874599418825768, 1.6700112310347082, 1.6700083407891528, 2.8850795868410497, 3.612041604838898, 3.490672881889288, 0.13185994542031884, 2.2685465126699547, 3.7544348542123562, 6.241591837653119, 6.513255262776884, 4.303879054109581, 1.0062625290710587, 11.85287083926253, 3.796945102876728, 4.4811855331967525, 4.9256719058025284, 5.33856586893377, 18.01958238254823, 15.337483277942944, 9.817315391362875, 0.11879683970093337, 1.6133514703039964, 5.981594572128506, 5.297948876043496, 2.0231189531744405, 0.06495441900389091, 0.010212308476347307, 0.008375715841469174, 0.1576887649267426, 0.8043412604987412]

shankari commented 6 years ago

Ok, so this is because the last two points are in fact the same point, although the object IDs are different.

2017-12-16 10:47:37,954:DEBUG:140737149985728:fix_squished_place: after recomputing for validation, with_speeds_df.tail =                          _id            ts                          fmt_time  \
28  5a356a46f6858fcefd76d29e  1.452972e+09  2016-01-16T11:16:28.072312-08:00
29  5a356a46f6858fcefd76d29f  1.452972e+09  2016-01-16T11:16:58.072312-08:00
30  5a356a46f6858fcefd76d2a0  1.452972e+09  2016-01-16T11:17:28.072312-08:00
31  5a356a46f6858fcefd76d2a1  1.452972e+09         2016-01-16T11:17:38-08:00
32  5a356a46f6858fcefd76d2a3  1.452972e+09         2016-01-16T11:17:38-08:00

     latitude   longitude  distance     speed
28  37.391324 -122.060069  0.306369  0.010212
29  37.391326 -122.060070  0.251271  0.008376
30  37.391293 -122.060036  4.730663  0.157689
31  37.391226 -122.060066  7.985249  0.804341
32  37.391226 -122.060066  0.000000  0.000000

Let's see when the second object id was created.

2017-12-16 10:47:34,809:DEBUG:140737149985728:About to store 24 points for section 5a356a44f6858fcefd76d270
2017-12-16 10:47:34,814:DEBUG:140737149985728:Returning multi_result.inserted_ids = [ObjectId('5a356a46f6858fcefd76d2a3'), ObjectId('5a356a46f6858fcefd76d2a4'), ObjectId('5a356a46f6858fcefd76d2a5'), ObjectId('5a356a46f6858fcefd76d2a6'), ObjectId('5a356a46f6858fcefd76d2a7'), ObjectId('5a356a46f6858fcefd76d2a8'), ObjectId('5a356a46f6858fcefd76d2a9'), ObjectId('5a356a46f6858fcefd76d2aa'), ObjectId('5a356a46f6858fcefd76d2ab'), ObjectId('5a356a46f6858fcefd76d2ac')]... of length 24

It was created for section 5a356a44f6858fcefd76d270 and appears to be the first point for that section. The currently retrieved head is

2017-12-16 10:47:37,948:DEBUG:140737149985728:fix_squished_place: after recomputing for validation, with_speeds_df.head =                         _id            ts                          fmt_time  \
0  5a356a49f6858fcefd76d2f0  1.452971e+09  2016-01-16T11:02:28.072312-08:00
1  5a356a46f6858fcefd76d283  1.452971e+09  2016-01-16T11:02:58.072312-08:00
2  5a356a46f6858fcefd76d284  1.452971e+09  2016-01-16T11:03:28.072312-08:00
3  5a356a46f6858fcefd76d285  1.452971e+09  2016-01-16T11:03:58.072312-08:00
4  5a356a46f6858fcefd76d286  1.452971e+09  2016-01-16T11:04:28.072312-08:00

    latitude   longitude   distance     speed
0  37.391099 -122.086379   0.000000  0.000000
1  37.390900 -122.086471  23.623798  0.787460
2  37.391260 -122.086131  50.100337  1.670011
3  37.391621 -122.085790  50.100250  1.670008
4  37.390948 -122.086282  86.552388  2.885080

which is was created for section 5a356a44f6858fcefd76d26f not 5a356a44f6858fcefd76d270

2017-12-16 10:47:34,803:DEBUG:140737149985728:About to store 31 points for section 5a356a44f6858fcefd76d26f
2017-12-16 10:47:34,809:DEBUG:140737149985728:Returning multi_result.inserted_ids = [ObjectId('5a356a46f6858fcefd76d283'), ObjectId('5a356a46f6858fcefd76d284'), ObjectId('5a356a46f6858fcefd76d285'), ObjectId('5a356a46f6858fcefd76d286'), ObjectId('5a356a46f6858fcefd76d287'), ObjectId('5a356a46f6858fcefd76d288'), ObjectId('5a356a46f6858fcefd76d289'), ObjectId('5a356a46f6858fcefd76d28a'), ObjectId('5a356a46f6858fcefd76d28b'), ObjectId('5a356a46f6858fcefd76d28c')]... of length 31

Let us look at the start time and end time for these two sections. 5a356a44f6858fcefd76d26f is

Inserting entry Entry({'_id': ObjectId('5a356a44f6858fcefd76d26f')
 'user_id': UUID('04bc05b3-4c2d-4138-bfbb-7eb45dce709a')
 'metadata': Metadata({'key': 'segmentation/raw_section'
 'data': Section('start_ts': 1452971053.281
 'start_fmt_time': '2016-01-16T11:04:13.281000-08:00'
 'end_ts': 1452971858
 'end_fmt_time': '2016-01-16T11:17:38-08:00'
 'start_loc': {"coordinates": [-122.0856179 37.3918034] "type": "Point"}
 'end_loc': {"coordinates": [-122.06006589 37.39122569] "type": "Point"}
 'duration': 804.7190001010895
 'source': 'SmoothedHighConfidenceMotion'
 'sensed_mode': 0})}) into timeseries

5a356a44f6858fcefd76d270 is

2017-12-16 10:47:32,190:DEBUG:140737149985728:Inserting entry Entry({'_id': ObjectId('5a356a44f6858fcefd76d270')
 'user_id': UUID('04bc05b3-4c2d-4138-bfbb-7eb45dce709a')
 'metadata': Metadata({'key': 'segmentation/raw_section'
 'data': Section({'start_ts': 1452971858
 'start_fmt_time': '2016-01-16T11:17:38-08:00'
 'end_ts': 1452972533
 'end_fmt_time': '2016-01-16T11:28:53-08:00'
 'start_loc': {"coordinates": [-122.06006589 37.39122569] "type": "Point"}
 'end_loc': {"coordinates": [-122.058218 37.39069702] "type": "Point"}
 'duration': 675
 'source': 'SmoothedHighConfidenceMotion'
 'sensed_mode': 2 })}) into timeseries

So basically the last point of the first section = first point of the section section, so the query brings up the additional point. That is not such a bad thing, and I don't think that it corresponds to an error in the diary generation. Let's verify from the UI.

Trip view	Section view

shankari commented 6 years ago

so to fix it, we need to strip out the last row (and potentially the last few rows!) if their speed and distance are zero. Given that we are now coming out of beta, maybe we should also ignore these asserts instead of failing during them. Others running the system will not find it easy to view the errors.

We can't use the last few rows because there might only be < few points, and then the first point will also be deleted. We will check for strictly the last row.

We can still log the errors, so on our own server(s), we can investigate errors. And the tests will still fail on regression because the ground truth won't match. Or we could have an extra validation flag that is on by default, and turned off on production builds. Let's go with the validation flag for now.

shankari commented 6 years ago

There is one more final issue left. We get these warnings while running the multi-processed pipeline.

/Users/shankari/OSS/anaconda/envs/emission/lib/python3.6/site-packages/pymongo/topology.py:145: UserWarning: MongoClient opened before fork. Create MongoClient with connect=False, or create client after forking. See PyMongo's documentation for details: http://api.mongodb.org/python/current/faq.html#pymongo-fork-safe>
  "MongoClient opened before fork. Create MongoClient "
/Users/shankari/OSS/anaconda/envs/emission/lib/python3.6/site-packages/pymongo/topology.py:145: UserWarning: MongoClient opened before fork. Create MongoClient with connect=False, or create client after forking. See PyMongo's documentation for details: http://api.mongodb.org/python/current/faq.html#pymongo-fork-safe>
  "MongoClient opened before fork. Create MongoClient "
/Users/shankari/OSS/anaconda/envs/emission/lib/python3.6/site-packages/pymongo/topology.py:145: UserWarning: MongoClient opened before fork. Create MongoClient with connect=False, or create client after forking. See PyMongo's documentation for details: http://api.mongodb.org/python/current/faq.html#pymongo-fork-safe>
  "MongoClient opened before fork. Create MongoClient "

The error message is pretty self-explanatory. Unfortunately, simply passing connect=False is not sufficient, because we do access the database in the scheduler before the fork.

    sel_uuids = esdu.get_all_uuids()
    sel_jobs = [(u, get_count(u)) for u in sel_uuids]

https://stackoverflow.com/questions/46441918/mongoclient-opened-before-fork-create-mongoclient https://stackoverflow.com/questions/44133435/mongoengine-and-dealing-with-userwarning-mongoclient-opened-before-fork-creat

shankari commented 6 years ago

Looking at the multiprocess documentation, it looks like using fork, the default on unix, https://docs.python.org/3/library/multiprocessing.html is the problem.

All resources of the parent are inherited by the child process. Note that safely forking a multithreaded process is problematic.

Using spawn would seem to obviate this at the expense of performance.

The parent process starts a fresh python interpreter process. The child process will only inherit those resources necessary to run the process objects run() method. In particular, unnecessary file descriptors and handles from the parent process will not be inherited. Starting a process using this method is rather slow compared to using fork or forkserver.

Let's see if this fixes it. Much easier that a more complicated restructuring.

shankari commented 6 years ago

That seems to have fixed it. We connect to the database 4 times - one for the master and 3 for the three forks. Yay! The performance hit is not too bad because we spawn only three-four subprocesses once at the beginning of every run.

storage not configured, falling back to sample, default configuration
Connecting to database URL localhost
debug not configured, falling back to sample, default configuration
storage not configured, falling back to sample, default configuration
Connecting to database URL localhost
storage not configured, falling back to sample, default configuration
Connecting to database URL localhost
storage not configured, falling back to sample, default configuration
Connecting to database URL localhost

shankari commented 6 years ago

Ran into an error similar https://github.com/e-mission/e-mission-server/issues/530#issuecomment-352209647 to again. This is clearly with the fix (https://github.com/shankari/e-mission-server/commit/dd7a18d24df6f98367d85ecaf23644c4f9f7a2a4) because I see the additional logging introduced in it, and apparently we did drop the last entry. And the previous entries were non-zero.

Need to see if it is reproducible.

2017-12-17 15:20:43,359:DEBUG:140047176312576:Found zero-speed, distance entries may nee
d to drop last entry                          _id            ts
 fmt_time  \
28  5a368b0cae2e7353fd79279b  1.452972e+09  2016-01-16T11:16:28.072312-08:00
29  5a368b0cae2e7353fd79279c  1.452972e+09  2016-01-16T11:16:58.072312-08:00
30  5a368b0cae2e7353fd79279d  1.452972e+09  2016-01-16T11:17:28.072312-08:00
31  5a368b0cae2e7353fd7927a0  1.452972e+09         2016-01-16T11:17:38-08:00
32  5a368b0cae2e7353fd79279e  1.452972e+09         2016-01-16T11:17:38-08:00

     latitude   longitude  distance     speed
28  37.391324 -122.060069  0.306369  0.010212
29  37.391326 -122.060070  0.251271  0.008376
30  37.391293 -122.060036  4.730663  0.157689
31  37.391226 -122.060066  7.985249  0.804341
32  37.391226 -122.060066  0.000000  0.000000

2017-12-17 15:20:43,359:DEBUG:140047176312576:delta_ts = 0.0, dropping last entry
2017-12-17 15:20:43,362:ERROR:140047176312576:[0.0, 0.603780040015428, 1.6700112310347082, 1.6700083407891528, 2.8850795868410497, 3.612041604838898, 3.490672881889288, 0.13185994542031884, 2.2685465126699547, 3.7544348542123562] != [0.0, 0.603780040015428, 1.6700112310347082, 1.6700083407891528, 2.8850795868410497, 3.612041604838898, 3.490672881889288, 0.13185994542031884, 2.2685465126699547, 3.7544348542123562]
2017-12-17 15:20:43,363:ERROR:140047176312576:Cleaning and resampling failed for user 86842c35-da28-32ed-a90e-2da6663c5c73
Traceback (most recent call last):
  File "/code/e-mission-server/emission/analysis/intake/cleaning/clean_and_resample.py", line 88, in clean_and_resample
    last_raw_place = save_cleaned_segments_for_ts(user_id, time_query.startTs, time_query.endTs)
  File "/code/e-mission-server/emission/analysis/intake/cleaning/clean_and_resample.py", line 114, in save_cleaned_segments_for_ts
    return save_cleaned_segments_for_timeline(user_id, tl)
  File "/code/e-mission-server/emission/analysis/intake/cleaning/clean_and_resample.py", line 140, in save_cleaned_segments_for_timeline
    (last_cleaned_place, filtered_tl) = create_and_link_timeline(tl, user_id, trip_map)
  File "/code/e-mission-server/emission/analysis/intake/cleaning/clean_and_resample.py", line 916, in create_and_link_timeline
    link_trip_start(ts, curr_cleaned_trip, curr_cleaned_start_place, raw_start_place)
  File "/code/e-mission-server/emission/analysis/intake/cleaning/clean_and_resample.py", line 970, in link_trip_start
    ts, cleaned_trip_data, cleaned_start_place_data)
  File "/code/e-mission-server/emission/analysis/intake/cleaning/clean_and_resample.py", line 1127, in _fix_squished_place_mismatch
    assert False
AssertionError

shankari commented 6 years ago

Changed the labels for the test phones to be more meaningful. Based on https://github.com/e-mission/e-mission-server/commit/310d045e8fc4eeb7524f4967ef297903d25a0b6d

$ ./e-mission-py.bash bin/historical/migrations/reset_test_phone_emails.py Connecting to database URL mongodb://open-data-rw:((22-remember-WRITTEN-lone-78))@[2600:1f18:45b0:7f01:e024:7991:e056:101e]/admin?authMechanism=SCRAM-SHA-1 {'n': 1, 'nModified': 1, 'ok': 1.0, 'updatedExisting': True} {'n': 1, 'nModified': 1, 'ok': 1.0, 'updatedExisting': True} {'n': 1, 'nModified': 1, 'ok': 1.0, 'updatedExisting': True} {'n': 1, 'nModified': 1, 'ok': 1.0, 'updatedExisting': True} {'n': 1, 'nModified': 1, 'ok': 1.0, 'updatedExisting': True} {'n': 1, 'nModified': 1, 'ok': 1.0, 'updatedExisting': True} {'n': 1, 'nModified': 1, 'ok': 1.0, 'updatedExisting': True} {'n': 1, 'nModified': 1, 'ok': 1.0, 'updatedExisting': True} {'n': 1, 'nModified': 1, 'ok': 1.0, 'updatedExisting': True} {'n': 1, 'nModified': 1, 'ok': 1.0, 'updatedExisting': True} {'n': 1, 'nModified': 1, 'ok': 1.0, 'updatedExisting': True} {'n': 1, 'nModified': 1, 'ok': 1.0, 'updatedExisting': True}

shankari commented 6 years ago

wrt https://github.com/e-mission/e-mission-server/issues/530#issuecomment-352340874 it is reproducible. This is very weird because this data is for Jan 16

          fmt_time  \
0  5a3e7b4df6858f4dd712518c  1.452971e+09  2016-01-16T11:02:28.072312-08:00
1  5a3e7b4bf6858f4dd712511f  1.452971e+09  2016-01-16T11:02:58.072312-08:00
2  5a3e7b4bf6858f4dd7125120  1.452971e+09  2016-01-16T11:03:28.072312-08:00
3  5a3e7b4bf6858f4dd7125121  1.452971e+09  2016-01-16T11:03:58.072312-08:00
4  5a3e7b4bf6858f4dd7125122  1.452971e+09  2016-01-16T11:04:28.072312-08:00

And the previous fix was also for Jan 16, and it is a part of the unit test cases and is passing. Hm. Maybe the previous error was for a different user (04bc05b3-4c2d-4138-bfbb-7eb45dce709a) https://github.com/e-mission/e-mission-server/issues/530#issuecomment-352202072

shankari commented 6 years ago

wrt https://github.com/e-mission/e-mission-server/issues/530#issuecomment-353733626 it is reproducible on alternate runs. This is so freaky.

$ ./e-mission-py.bash bin/purge_analysis_database.py
$ ./e-mission-py.bash bin/debug/intake_single_user.py -e test-assertion-not-fixed  (PASSES)
$ ./e-mission-py.bash bin/purge_analysis_database.py
$ ./e-mission-py.bash bin/debug/intake_single_user.py -e test-assertion-not-fixed   (FAILS)

No idea why that is happening.

shankari commented 6 years ago

ok, so this happens because sometimes the point from the other section appears to have an earlier write_ts than the point from our section (see point with index = 31, unsure why). So when we drop the point, we drop the point from our section and not the other section.

2017-12-23 22:39:50,998:DEBUG:140737149985728:Found zero-speed, distance entries may need to drop last entry                          _id                   section            ts  \
28  5a3f4bb5f6858f52c88c8c32  5a3f4bb5f6858f52c88c8c16  1.452972e+09
29  5a3f4bb5f6858f52c88c8c33  5a3f4bb5f6858f52c88c8c16  1.452972e+09
30  5a3f4bb5f6858f52c88c8c34  5a3f4bb5f6858f52c88c8c16  1.452972e+09
31  5a3f4bb5f6858f52c88c8c37  5a3f4bb5f6858f52c88c8c36  1.452972e+09
32  5a3f4bb5f6858f52c88c8c35  5a3f4bb5f6858f52c88c8c16  1.452972e+09

We can fix this by dropping the point that is from a different section. But then the last point has a wrong calculated speed, so we need to recompute it too.

So we have two choices:

drop the entry before we recompute the speeds
relax the asserts, since this is for validation anyway

Going to try to drop the entry before we recompute the speeds since that feels more principled than removing the last entry. If that doesn't work, going to option (2)

shankari commented 6 years ago

No errors so far on the public server, so will start migrating the main e-mission server!!

shankari commented 6 years ago

Goal is to follow the instructions exactly from https://github.com/e-mission/e-mission-server/wiki/Multi-layer-stack

Provision the servers using the cloudformation template. Successful! Added the new IP addresses to my local /etc/hosts
Starting to configure the database server, but can't make any outgoing connections
- Turns out that the reason is that it was associated with a different (unnamed) routing table instead of default-route.
- even after that was fixed, it looked like it didn't work because outbound pings were only enabled for the local subnet over ipv4 and for the internet over ipv6, but this VPC didn't support ipv6.

After those fixes, was able to ping google.com. Need to update the cloudformation template...

shankari commented 6 years ago

ipv6 support in cloudformation appears to be extremely limited. https://forums.aws.amazon.com/click.jspa?searchID=8492229&messageID=751582

There is a somewhat unintuitive method to turn on ipv6 for a VPC. https://github.com/trailofbits/algo/commit/f3d0abf149191dbec2f2f3095c90c92fd53da0e7

But specifying an ipv6 CIDR for the two subnets seems pretty complicated, and there is no option to allow AWS to assign it by default, similar to the UI. The complex method here works for a single subnet but not for multiple ones, since it is not clear how to rewrite the ipv6 subnet from the VPC in two different ways. https://forums.aws.amazon.com/thread.jspa?messageID=773600&#773600

For the manually created VPC, we see

VPC CIDR: 2600:1f18:45b0:7f00::/56
public subnet CIDR: 2600:1f18:45b0:7f00::/64
private subnet CIDR: 2600:1f18:45b0:7f01::/64

Note that going from the VPC -> public subnet is simple because we essentially only have to remove the prefix. But going from VPC -> private subnet is more complex because we have to replace the last non-empty token with one that is off by one.

So I am going to focus on getting the ipv4 working out of the box, and then documenting (hopefully small) manual fixes for ipv6 later.

shankari commented 6 years ago

Setup default server group with:

SSL for the webserver
encrypted log directories on the webapp and analysis
encrypted data, journal and log on the database

No access control set up (yet!) for the database. We will set that up if we choose to use the notebook server. For now, ipython on either analysis or webapp will work just fine.

shankari commented 6 years ago

Next, I need to copy all the data over from the primary server. Note that we are changing storage versions (https://docs.mongodb.com/manual/tutorial/change-standalone-wiredtiger/).

Before I copy the data over, let me delete the open data which is already in the open server. But before any deletions, I want to take a backup.

So the sequence will be:

attach a new EBS volume
mongodump
delete copied data using bin/debug/purge_multi_timeline_for_range.py
check that regular data still exists
mongodump again
attach volume to new server
mongorestore
re-run analysis pipeline
DONE!!!

shankari commented 6 years ago

Updates along with notes on the progress of the plan outlined in https://github.com/e-mission/e-mission-server/issues/530#issuecomment-354682128

Attach a new EBS volume

Current database size is

$ df -h
Filesystem        Size  Used Avail Use% Mounted on
/dev/xvda1        7.8G  5.6G  1.9G  76% /
/dev/xvdd         7.8G  412M  7.0G   6% /home/e-mission
/dev/mapper/xvdb   37G   14G   22G  39% /mnt
/dev/mapper/xvdc   37G   19G   17G  54% /mnt/logs
/dev/mapper/xvdf  3.0T  143G  2.7T   5% /mnt/e-mission-primary-db

So let's create a ~ 300GB volume to leave some wiggle room.

The original webapp is in us-east-1c, while we are now creating stuff in us-east-1d. So we're going to have to use snapshots to migrate to us-east-1d.

Created new 300GB volume, named it migration-volume and attached it to the webapp as /dev/sdg

Logged in to the instance and mounted the volume

$ sudo mkdir -p /transfer
$ sudo mkfs.ext4 /dev/xvdg
$ sudo chown ubuntu:ubuntu /transfer
$ sudo mount /dev/xvdg /transfer

mongodump

Before dumping, get some stats on the number of entries of various types

In [878]: edb.get_uuid_db().find().count()
Out[878]: 516

In [879]: edb.get_tim
edb.get_timeseries_db        edb.get_timeseries_error_db

In [879]: edb.get_timeseries_db().find().count()
Out[879]: 53718353

In [880]: edb.get_analysis_timeseries_db().find().count()
Out[880]: 16573823

In [883]: len(edb.get_pipeline_state_db().find().distinct('user_id'))
Out[883]: 487

In [884]: len(edb.get_usercache_db().find().distinct('user_id'))
Out[884]: 502

Tried running distinct on the timeseries, but took too long.

Actually dumping now.

$  mongodump --out /transfer/
2018-01-01T23:30:09.731+0000 all dbs
2018-01-01T23:30:09.731+0000 DATABASE: Stage_database    to     /transfer/Stage_database
2018-01-02T00:51:07.047+0000 DATABASE: admin     to     /transfer/admin

Took ~ 1.5 hours.

Kill the server

Now kill the server so we don't get any new data during the migration.

$ ps -aef | grep python
ubuntu    1130     1  0  2016 ?        01:26:55 /home/ubuntu/anaconda/bin/python /home/ubuntu/anaconda/bin/supervisord
ubuntu   21856 21855  0  2016 pts/3    00:00:15 /home/ubuntu/anaconda/bin/python /home/ubuntu/anaconda/bin/ipython
root     28163 28162  0  2017 ?        17:48:57 /home/ubuntu/anaconda/bin/python emission/net/api/cfc_webapp.py
ubuntu   32664 32663  0  2016 pts/4    00:05:22 /home/ubuntu/anaconda/bin/python /home/ubuntu/anaconda/bin/ipython

$ kill -9 1130
$ sudo kill -9 28163
$ ps -aef | grep python
ubuntu    5577  2466  0 00:54 pts/0    00:00:00 grep --color=auto python
ubuntu   21856 21855  0  2016 pts/3    00:00:15 /home/ubuntu/anaconda/bin/python /home/ubuntu/anaconda/bin/ipython
ubuntu   32664 32663  0  2016 pts/4    00:05:22 /home/ubuntu/anaconda/bin/python /home/ubuntu/anaconda/bin/ipython

Server is down. Got a "503 Service Unavailable" error.

Delete unwanted data

Delete all the entries that we have already copied to the open data server (interscity* + test phones)

(emission) $ ./e-mission-py.bash bin/debug/purge_multi_timeline_for_range.py /tmp/odc-usp-2017/user_

Now the uuid entries should be 516 - 29 = 487. Which they are!

In [891]: edb.get_uuid_db().find().count()
Out[891]: 487

However, the number of entries in the pipeline state is unchanged.

In [892]: len(edb.get_pipeline_state_db().find().distinct('user_id'))
Out[892]: 487

Let's update the script to delete those too (https://github.com/e-mission/e-mission-server/pull/557).

$ ./e-mission-py.bash bin/debug/purge_multi_timeline_for_range.py --pipeline-purge /tmp/odc-usp-2017/user_
INFO:root:==================================================
INFO:root:Deleting data from file /tmp/odc-usp-2017/user_dcdb5f74-071a-4e5b-a954-e613c5b46e5d.gz
...
INFO:root:For uuid dcdb5f74-071a-4e5b-a954-e613c5b46e5d, deleting entries from the pipeline_state_db
INFO:root:result = {u'ok': 1, u'n': 7}

After this, pipeline states are also lowered by 29 entries, as expected.

In [901]: len(edb.get_pipeline_state_db().find().distinct('user_id'))
Out[901]: 458

Delete all data from test phones.

$ ./e-mission-py.bash bin/debug/purge_multi_timeline_for_range.py --pipeline-purge /tmp/public_data/dump_

Now, all counts go down by a further ~ 13 entries.

In [904]: edb.get_uuid_db().find().count()
Out[904]: 475

In [905]: edb.get_timeseries_db().find().count()
Out[905]: 36644553

In [906]: edb.get_analysis_timeseries_db().find().count()
Out[906]: 15782219

In [908]: len(edb.get_pipeline_state_db().find().distinct('user_id'))
Out[908]: 446

Delete all output config entries in the usercache. We don't copy these any more ( https://github.com/e-mission/e-mission-server/commit/c17f1a0fe9188462bde3fdd4866f78f99c8d998f) so to make things be consistent, we want to delete output config entries from the usercache.

But we also don't want to delete incoming config entries (like the consent) from the usercache. I think they will be pushed again, but better to avoid running into that issue in the first place. When we did run the consent stuff, we used to save the config using putDocument.

def save_config(user_id, uc, time_query, module_name):
    config_fn = get_configurator("emission.analysis.configs.%s" % module_name)
    logging.debug("for %s, config_fn = %s" % (module_name, config_fn))
    (config, last_processed_ts) = config_fn(user_id, time_query)
    logging.debug("for %s, config is %s" % (user_id, config))
    if config is not None:
        uc.putDocument("config/%s" % module_name, config)
        return last_processed_ts
    else:
        return 0

That stores the value as a document. When we get the incoming entries, they are of type rw-document. So we should be able to delete all documents from the three config types. Let's do some queries to ensure that this is OK.

We should have at least some rw-document entries since we had users sign up after I turned off the pipeline. And we should have at least some document entries since we were running the pipeline before.

Hm. There are 740 rw-documents and 161 documents. Neither of which matches the number of users.

In [895]: edb.get_usercache_db().find({"metadata.key": "config/consent", "metadata.type": "rw-document"}).count()
Out[895]: 740

In [495]: edb.get_usercache_db().find({"metadata.key": "config/consent", "metadata.type": "document"}).count()
Out[495]: 161

The timeseries has way more rw-documents. Basically, it looks like rw-documents are generated over and over.

In [896]: edb.get_timeseries_db().find({"metadata.key": "config/consent", "metadata.type": "rw-document"}).count()
Out[896]: 1564

In [897]: edb.get_timeseries_db().find({"metadata.key": "config/consent", "metadata.type": "document"}).count()
Out[897]: 0

And the counts of the three config objects are different. Wut!!

In [898]: edb.get_usercache_db().find({"metadata.key": "config/sensor_config", "metadata.type": "rw-document"}).count()
Out[898]: 500

In [899]: edb.get_usercache_db().find({"metadata.key": "config/sync_config", "metadata.type": "rw-document"}).count()
Out[899]: 160

No time to investigate this further, so for now, let us just delete all document entries for all the three types of config objects - config/sensor_config, config/sync_config, config/consent.

In [903]: edb.get_usercache_db().find({"metadata.type": "document", "metadata.key": {"$in": ['config/consent', 'config/sensor_config', 'config/sync_config']}}).count()
Out[903]: 239

In [910]: edb.get_usercache_db().remove({"metadata.type": "document", "metadata.key": {"$in": ['config/consent', 'config/sensor_config', 'config/sync_config']}})
Out[910]: {u'n': 239, u'ok': 1}

Remove unused collections. I originally planned to remove these from the dump before restoring instead of deleting these from the database. However, it turns out that there is a file system.indexes.bson that seems to have entries for all the collections in the database. So it is not clear that it is safe to do so, and we should really delete the collections before re-dumping.

Stage_alternative_trips
Stage_clients
Stage_common_place (empty)
Stage_common_trips (empty)
Stage_pending_signups
Stage_place (empty)
Stage_routeCluster
Stage_routeDistanceMatrix
Stage_section_new (empty)
Stage_stop (empty)
Stage_trip_new (empty)
Stage_user_moves_access
Stage_utility_models
Stage_Worktime

In [911]: edb.get_alternatives_db().remove()
Out[911]: {u'n': 101146, u'ok': 1}

In [912]: edb.get_client_db().remove()
Out[912]: {u'n': 3, u'ok': 1}

In [913]: edb.get_common_place_db().remove()
Out[913]: {u'n': 0, u'ok': 1}

In [914]: edb.get_common_trip_db().remove()
Out[914]: {u'n': 0, u'ok': 1}

In [915]: edb.get_pending_signup_db().remove()
Out[915]: {u'n': 25, u'ok': 1}

In [916]: edb._get_current_db().Stage_place.remove()
Out[916]: {u'n': 0, u'ok': 1}

In [917]: edb.get_routeCluster_db().remove()
Out[917]: {u'n': 90, u'ok': 1}

In [918]: edb._get_current_db().Stage_routeDistanceMatrix.remove()
Out[918]: {u'n': 7, u'ok': 1}

In [919]: edb._get_current_db().Stage_section_new.remove()
Out[919]: {u'n': 0, u'ok': 1}

In [920]: edb._get_current_db().Stage_stop.remove()
Out[920]: {u'n': 0, u'ok': 1}

In [921]: edb._get_current_db().Stage_trip_new.remove()
Out[921]: {u'n': 0, u'ok': 1}

In [922]: edb._get_current_db().Stage_user_moves_access.remove()
Out[922]: {u'n': 118, u'ok': 1}

In [923]: edb._get_current_db().Stage_utility_models.remove()
Out[923]: {u'n': 36, u'ok': 1}

In [924]: edb._get_current_db().Stage_Worktime.remove()
Out[924]: {u'n': 2662, u'ok': 1}

Stats are a bit tricky. In general, we want to delete stats from the stats-specific db once we have moved them to the timeseries. I know that I wrote a script for the migration (f4aa21ced3108a80d73ee1bb1588aa3532cd67c2), and I ran it locally to generate data for the TRB paper, but it is unclear that I ran it on the real server. Let's poke around and see if I did.

The script inserts the new stats entries, so I can't just see whether the _id matches. I can try to correlate against ts entries, though.

Didn't seem to have run it for client entries.

In [929]: entry_1 = edb.get_client_stats_db_backup().find_one(); entry_1
Out[929]:
{u'_id': ObjectId('543c8f4688f663587d21cd4e'),
 u'client_app_version': u'2.0.1',
 u'client_os_version': u'4.4.2',
 u'reading': u'4.0',
 u'reported_ts': 1413254982.201671,
 u'stat': u'battery_level',
 u'ts': u'1413254944995',
 u'user': UUID('f8fee20c-0f32-359d-ba75-bce97a7ac83b')}

In [934]: edb.get_timeseries_db().find_one({"metadata.key": "background/battery", "metadata.write_ts": entry_1["reported_ts"]})

In [935]: edb.get_timeseries_db().find_one({"metadata.key": "background/battery", "data.ts": int(entry_1["ts"])})

In [936]: edb.get_timeseries_db().find_one({"metadata.key": "background/battery", "data.ts": entry_1["ts"]})

Checking counts.

In [951]: edb.get_client_stats_db_backup().find({"user":  UUID('96af3842-d5fb-4f13-aea0-726efaeba6ea')}).count()
Out[951]: 2556

In [952]: edb.get_timeseries_db().find({"user_id": UUID('96af3842-d5fb-4f13-aea0-726efaeba6ea'), "metadata.key": {"$in": ["stats/client_time", "stats/client_nav_event", "stats/client_error"]}}).count()
Out[952]: 13974

The oldest entry in the stats timeseries was

In [956]: list(edb.get_timeseries_db().find({"user_id": UUID('96af3842-d5fb-4f13-aea0-726efaeba6ea'), "metadata.key": {"$in": ["stats/client_time", "stats/client_nav_event", "stats/client_error"]}}).sort("data.ts", -1).limit(1))
Out[956]:
[{u'_id': ObjectId('581b4c5588f6630386d0ea58'),
  u'data': {u'client_app_version': u'1.2.0',
   u'client_os_version': u'4.4.2',
   u'name': u'sync_launched',
   u'reading': None,
   u'ts': 1478184016437.0},
  u'metadata': {u'key': u'stats/client_nav_event',
   u'platform': u'server',
   u'time_zone': u'America/Los_Angeles',
   u'write_ts': 1478184016437.0},
  u'user_id': UUID('96af3842-d5fb-4f13-aea0-726efaeba6ea')}]

Unfortunately, the timestamp is wrong because it is stored in ms instead of sec, but after fixing, it is ~ Oct 30, which is consistent with the switch (https://github.com/e-mission/e-mission-server/commit/7487c82578e8933f4da8f9d3fa3522c102906c81)

In [957]: arrow.get(1478184016437.0)
...
ValueError: year is out of range

In [958]: arrow.get(1478184016.437)
Out[958]: <Arrow [2016-11-03T14:40:16.437000+00:00]>

Sorting the other way gives us a valid entry, but one which is still fairly old because of the incorrect ts conversion.

In [959]: list(edb.get_timeseries_db().find({"user_id": UUID('96af3842-d5fb-4f13-aea0-726efaeba6ea'), "metadata.key": {"$in": ["stats/client_time", "stats/client_nav_event", "stats/client_error"]}}).sort("data.ts", 1).limit(1))
Out[959]:
[{u'_id': ObjectId('581b8771cb17471ac0b16595'),
  u'data': {u'client_app_version': u'1.4.0',
   u'client_os_version': u'4.4.2',
   u'name': u'sync_launched',
   u'reading': -1,
   u'ts': 1478187644.294},
  u'metadata': {u'key': u'stats/client_nav_event',
   u'platform': u'android',
   u'read_ts': 1478206545.480631,
   u'time_zone': u'America/Los_Angeles',
   u'type': u'message',
   u'write_fmt_time': u'2016-11-03T08:40:44.544000-07:00',
   u'write_local_dt': {u'day': 3,
    u'hour': 8,
    u'minute': 40,
    u'month': 11,
    u'second': 44,
    u'timezone': u'America/Los_Angeles',
    u'weekday': 3,
    u'year': 2016},
   u'write_ts': 1478187644.544},
  u'user_id': UUID('96af3842-d5fb-4f13-aea0-726efaeba6ea')}]

In [960]: arrow.get(1478187644.294)
Out[960]: <Arrow [2016-11-03T15:40:44.294000+00:00]>

But the entries in the stats DB end around 2016-11-02 so it is fairly clear that the conversion never happened.

In [969]: list(edb.get_client_stats_db_backup().find({"user": UUID("96af3842-d5fb-4f13-aea0-726efaeba6ea")}).sort("ts", -1).limit(3))

In [970]: arrow.get(1478058014.172)
Out[970]: <Arrow [2016-11-02T03:40:14.172000+00:00]>

In [971]: arrow.get(1478058014.172)
Out[971]: <Arrow [2016-11-02T03:40:14.172000+00:00]>

In [972]: arrow.get(1478054410.652)
Out[972]: <Arrow [2016-11-02T02:40:10.652000+00:00]>

So:

we need to convert the older stats
the oldest converted stats are wrong

In [975]: edb.get_timeseries_db().find({"user_id": UUID('96af3842-d5fb-4f13-aea0-726efaeba6ea'), "metadata.key": {"$in": ["stats/client_time", "stats/client_nav_event", "stats/client_error"]}}, {"data.ts": {"$gt": now.timestamp}}).count()
Out[975]: 13974

In [984]: invalid_entries = 0

In [985]: for e in edb.get_client_stats_db_backup().find():
        if type(e['ts']) == str or type(e['ts']) == unicode:
                invalid_entries = invalid_entries + 1
   .....:

In [986]: invalid_entries
Out[986]: 321986

While checking stat counts, discovered that all my data was deleted since I was part of the open data collection, and purging deletes all data, not just data in the time range specified!! Aborting, restoring and then re-deleting invalid collections.

shankari commented 6 years ago

Let's fix the stats entries before restoring. That allows us to test out the stats conversion in a dry run before making the actual changes.

shankari commented 6 years ago

Fixing stats entries:

fix existing timeseries entries with ms instead of seconds

In [66]: edb.get_timeseries_db().find({"metadata.key": {"$in": ["stats/client_time", "stats/client_nav_event", "stats/client_error"]}, "data.ts": {"$gt": now.timestamp}}).count()
Out[66]: 3

In [68]: for e in edb.get_timeseries_db().find({"metadata.key": {"$in": ["stats/client_time", "stats/client_nav_event", "stats/client_error"]}, "data.ts": {"$gt": now.timestamp}}):
     edb.get_timeseries_db().update({"_id": e["_id"]},
      {"$set": {"data.ts": float(e["data"]["ts"])/1000,
               "metadata.write_ts": float(e["metadata"]["write_ts"])/1000}})

In [69]: edb.get_timeseries_db().find({"metadata.key": {"$in": ["stats/client_time", "stats/client_nav_event", "stats/client_error"]},
"data.ts": {"$gt": now.timestamp}}).count()
Out[69]: 0

Before:

[{u'_id': ObjectId('59c787af88f6636e03100808'),
  u'data': {u'client_app_version': u'2.1',
   u'client_os_version': u'5.0.1',
   u'name': u'sync_pull_list_size',
   u'reading': 0.0,
   u'ts': 1506248617116.0},
  u'metadata': {u'key': u'stats/client_time',
   u'platform': u'server',
   u'time_zone': u'America/Los_Angeles',
   u'write_ts': 1506248617116.0},
  u'user_id': UUID('850d2923-cc02-4e6c-939e-c762dd3ba189')},
 {u'_id': ObjectId('59c787af88f6636e0310080a'),
  u'data': {u'client_app_version': u'2.1',
   u'client_os_version': u'5.0.1',
   u'name': u'sync_push_list_size',
   u'reading': 0.0,
   u'ts': 1506248617116.0},
  u'metadata': {u'key': u'stats/client_time',
   u'platform': u'server',
   u'time_zone': u'America/Los_Angeles',
   u'write_ts': 1506248617116.0},
  u'user_id': UUID('850d2923-cc02-4e6c-939e-c762dd3ba189')},
 {u'_id': ObjectId('59c787af88f6636e0310080c'),
  u'data': {u'client_app_version': u'2.1',
   u'client_os_version': u'5.0.1',
   u'name': u'sync_launched',
   u'reading': None,
   u'ts': 1506248617116.0},
  u'metadata': {u'key': u'stats/client_nav_event',
   u'platform': u'server',
   u'time_zone': u'America/Los_Angeles',
   u'write_ts': 1506248617116.0},
  u'user_id': UUID('850d2923-cc02-4e6c-939e-c762dd3ba189')}]

After:

[{'_id': ObjectId('59c787af88f6636e03100808'),
  'data': {'client_app_version': '2.1',
   'client_os_version': '5.0.1',
   'name': 'sync_pull_list_size',
   'reading': 0.0,
   'ts': 1506248617.116},
  'metadata': {'key': 'stats/client_time',
   'platform': 'server',
   'time_zone': 'America/Los_Angeles',
   'write_ts': 1506248617.116},
  'user_id': UUID('850d2923-cc02-4e6c-939e-c762dd3ba189')},
 {'_id': ObjectId('59c787af88f6636e0310080a'),
  'data': {'client_app_version': '2.1',
   'client_os_version': '5.0.1',
   'name': 'sync_push_list_size',
   'reading': 0.0,
   'ts': 1506248617.116},
  'metadata': {'key': 'stats/client_time',
   'platform': 'server',
   'time_zone': 'America/Los_Angeles',
   'write_ts': 1506248617.116},
  'user_id': UUID('850d2923-cc02-4e6c-939e-c762dd3ba189')},
 {'_id': ObjectId('59c787af88f6636e0310080c'),
  'data': {'client_app_version': '2.1',
   'client_os_version': '5.0.1',
   'name': 'sync_launched',
   'reading': None,
   'ts': 1506248617.116},
  'metadata': {'key': 'stats/client_nav_event',
   'platform': 'server',
   'time_zone': 'America/Los_Angeles',
   'write_ts': 1506248617.116},
  'user_id': UUID('850d2923-cc02-4e6c-939e-c762dd3ba189')}]

Timestamp check:

In [72]: arrow.get(1506248617116.0)
ValueError: year 49701 is out of range

In [73]: arrow.get(1506248617.116)
Out[73]: <Arrow [2017-09-24T10:23:37.116000+00:00]>

Applying this fix to the real entries....

In [515]: edb.get_timeseries_db().find({"metadata.key": {"$in": ["stats/client_time", "stats/client_nav_event", "stats/client_error"]}, "data.ts": {"$gt": now.timestamp}}).count()
Out[515]: 1717191

In [516]: edb.get_timeseries_db().find({"metadata.key": {"$in": ["stats/server_api_error", "stats/server_api_time"]}, "data.ts": {"$gt": now.timestamp}}).count()               Out[516]: 0

In [517]: edb.get_timeseries_db().find({"metadata.key": {"$in": ["stats/pipeline_error", "stats/pipeline_time"]}, "data.ts": {"$gt": now.timestamp}}).count()
Out[517]: 0

In [518]: for e in edb.get_timeseries_db().find({"metadata.key": {"$in": ["stats/client_time", "stats/client_nav_event", "stats/client_error"]}, "data.ts": {"$gt": now.timestamp}}):
   .....:     edb.get_timeseries_db().update({"_id": e["_id"]},
   .....:       {"$set": {"data.ts": float(e["data"]["ts"])/1000,
   .....:               "metadata.write_ts": float(e["metadata"]["write_ts"])/1000}})
   .....:

In [519]: edb.get_timeseries_db().find({"metadata.key": {"$in": ["stats/client_time", "stats/client_nav_event", "stats/client_error"]}, "data.ts": {"$gt": now.timestamp}}).count()
Out[519]: 0

Fix client conversion issues. Validate fix on local data. Commit fix (b47617f32feca9d01286c994aa06b60a90cd95bd)

verification on client data

In [80]: list(edb.get_timeseries_db().find({"metadata.key": {"$in": ["stats/client_time"
    ...: , "stats/client_nav_event", "stats/client_error"]}, "data.ts": {"$gt": now.time
    ...: stamp}}).limit(3))
Out[80]: []

In [81]: list(edb.get_timeseries_db().find({"metadata.key": {"$in": ["stats/client_time"
    ...: , "stats/client_nav_event", "stats/client_error"]}}).limit(3))
Out[81]:
[{'_id': ObjectId('5a4bacbbf6858f499c29ac94'),
  'data': {'client_app_version': '2.1',
   'client_os_version': '8.3',
   'name': 'result_display_failed',
   'reading': None,
   'ts': 1435258760.849617},
  'metadata': {'key': 'stats/client_error',
   'platform': 'server',
   'time_zone': 'America/Los_Angeles',
   'write_ts': 1435259103.039456},
  'user_id': UUID('b0d937d0-70ef-305e-9563-440369012b39')},
 {'_id': ObjectId('5a4bacbbf6858f499c29ac95'),
  'data': {'client_app_version': '2.1',
   'client_os_version': '8.3',
   'name': 'result_display_failed',
   'reading': None,
   'ts': 1435258798.327425},
  'metadata': {'key': 'stats/client_error',
   'platform': 'server',
   'time_zone': 'America/Los_Angeles',
   'write_ts': 1435259103.039456},
  'user_id': UUID('b0d937d0-70ef-305e-9563-440369012b39')},
 {'_id': ObjectId('5a4baaa1f6858f499c26508f'),
  'data': {'client_app_version': '2.0.1',
   'client_os_version': '4.4.4',
   'name': 'sync_pull_list_size',
   'reading': 7.0,
   'ts': 1413466406.571},
  'metadata': {'key': 'stats/client_time',
   'platform': 'server',
   'time_zone': 'America/Los_Angeles',
   'write_ts': 1413466453.32879},
  'user_id': UUID('8fea63f7-1f4a-382f-b710-118568652329')}]

In [83]: [(arrow.get(e["data"]["ts"]), arrow.get(e["metadata"]["write_ts"])) for e in sa
    ...: mple_client_entries]
Out[83]:
[(<Arrow [2015-06-25T18:59:20.849617+00:00]>,
  <Arrow [2015-06-25T19:05:03.039456+00:00]>),
 (<Arrow [2015-06-25T18:59:58.327425+00:00]>,
  <Arrow [2015-06-25T19:05:03.039456+00:00]>),
 (<Arrow [2014-10-16T13:33:26.571000+00:00]>,
  <Arrow [2014-10-16T13:34:13.328790+00:00]>)]

verification on server data

In [84]: sample_server_entries = list(edb.get_timeseries_db().find({"metadata.key": {"$i
    ...: n": ["stats/server_api_time", "stats/server_api_error"]}}).limit(3))

In [85]: sample_server_entries
Out[85]:
[{'_id': ObjectId('5a4bb12df6858f499c303de5'),
  'data': {'name': 'POST_/tripManager/getUnclassifiedSections',
   'reading': 0.1927809715270996,
   'ts': 1417725020.540355},
  'metadata': {'key': 'stats/server_api_time',
   'platform': 'server',
   'time_zone': 'America/Los_Angeles',
   'write_ts': 1417725020.540355},
  'user_id': UUID('3a307244-ecf1-3e6e-a9a7-3aaf101b40fa')},
 {'_id': ObjectId('5a4bb12df6858f499c303de6'),
  'data': {'name': 'POST_/tripManager/getUnclassifiedSections',
   'reading': 0.20390009880065918,
   'ts': 1417725072.799086},
  'metadata': {'key': 'stats/server_api_time',
   'platform': 'server',
   'time_zone': 'America/Los_Angeles',
   'write_ts': 1417725072.799086},
  'user_id': UUID('3a307244-ecf1-3e6e-a9a7-3aaf101b40fa')},
 {'_id': ObjectId('5a4bb12df6858f499c303de7'),
  'data': {'name': 'POST_/tripManager/getUnclassifiedSections',
   'reading': 0.21010494232177734,
   'ts': 1417725104.509851},
  'metadata': {'key': 'stats/server_api_time',
   'platform': 'server',
   'time_zone': 'America/Los_Angeles',
   'write_ts': 1417725104.509851},
  'user_id': UUID('3a307244-ecf1-3e6e-a9a7-3aaf101b40fa')}]

In [86]: [(arrow.get(e["data"]["ts"]), arrow.get(e["metadata"]["write_ts"])) for e in sample_server_entries]
Out[86]:
[(<Arrow [2014-12-04T20:30:20.540355+00:00]>,
  <Arrow [2014-12-04T20:30:20.540355+00:00]>),
 (<Arrow [2014-12-04T20:31:12.799086+00:00]>,
  <Arrow [2014-12-04T20:31:12.799086+00:00]>),
 (<Arrow [2014-12-04T20:31:44.509851+00:00]>,
  <Arrow [2014-12-04T20:31:44.509851+00:00]>)]

Ran on server, seemed to work

In [999]: entry_1 = edb.get_client_stats_db_backup().find_one()

In [1000]: edb.get_timeseries_db().find_one({"metadata.key": "background/battery", "data.ts": float(entry_1["ts"])/1000})
Out[1000]:
{u'_id': ObjectId('5a4bbc7188f663417cbd343b'),
 u'data': {u'battery_level_pct': 4.0, u'ts': 1413254944.995},
 u'metadata': {u'key': u'background/battery',
  u'platform': u'server',
  u'time_zone': u'America/Los_Angeles',
  u'write_ts': 1413254982.201671},
 u'user_id': UUID('f8fee20c-0f32-359d-ba75-bce97a7ac83b')}

With all the fixes, does the sorting work?

In [1002]: list(edb.get_timeseries_db().find({"user_id": UUID('96af3842-d5fb-4f13-aea0-726efaeba6ea'), "metadata.key": {"$in": ["stats/client_time", "stats/client_nav_event", "stats/client_error"]}}).sort("data.ts", -1).limit(1))
Out[1002]:
[{u'_id': ObjectId('59a0c57dcb17471ac08bc0c6'),
  u'data': {u'client_app_version': u'2.3.0',
   u'client_os_version': u'7.0',
   u'name': u'sync_duration',
   u'reading': 6.459,
   u'ts': 1503704747.497},
  u'metadata': {u'key': u'stats/client_time',
   u'platform': u'android',
   u'read_ts': 0,
   u'time_zone': u'America/Los_Angeles',
   u'type': u'message',
   u'write_fmt_time': u'2017-08-25T16:45:47.499000-07:00',
   u'write_ts': 1503704747.499},
  u'user_id': UUID('96af3842-d5fb-4f13-aea0-726efaeba6ea')}]

In [1003]: list(edb.get_timeseries_db().find({"user_id": UUID('96af3842-d5fb-4f13-aea0-726efaeba6ea'), "metadata.key": {"$in": ["stats/client_time", "stats/client_nav_event", "stats/client_error"]}}).sort("data.ts", 1).limit(1))
Out[1003]:
[{u'_id': ObjectId('5a4bc0f688f663417cc52d32'),
  u'data': {u'client_app_version': u'1.0.0',
   u'client_os_version': u'4.4.2',
   u'name': u'sync_duration',
   u'reading': 5638.0,
   u'ts': 1474414.807961},
  u'metadata': {u'key': u'stats/client_time',
   u'platform': u'server',
   u'time_zone': u'America/Los_Angeles',
   u'write_ts': 1474422045.947868},
  u'user_id': UUID('96af3842-d5fb-4f13-aea0-726efaeba6ea')}]

In [1004]: arrow.get(1474422045.947868)
Out[1004]: <Arrow [2016-09-21T01:40:45.947868+00:00]>

Let's cross-check with the client stats

In [1005]: list(edb.get_client_stats_db_backup().find({"user": UUID("96af3842-d5fb-4f13-aea0-726efaeba6ea")}).sort("ts", -1).limit(1))
Out[1005]:
[{u'_id': ObjectId('5819602288f6630624d4b02a'),
  u'client_app_version': u'1.2.0',
  u'client_os_version': u'4.4.2',
  u'reading': u'55.0',
  u'reported_ts': 1478058018.196038,
  u'stat': u'battery_level',
  u'ts': 1478058014.172,
  u'user': UUID('96af3842-d5fb-4f13-aea0-726efaeba6ea')}]

In [1007]: arrow.get(1478058014.172)
Out[1007]: <Arrow [2016-11-02T03:40:14.172000+00:00]>

In [1009]: list(edb.get_client_stats_db_backup().find({"user": UUID("96af3842-d5fb-4f13-aea0-726efaeba6ea")}).sort("ts", 1).limit(1))
Out[1009]:
[{u'_id': ObjectId('57e1e51d88f663307e959ca5'),
  u'client_app_version': u'1.0.0',
  u'client_os_version': u'4.4.2',
  u'reading': u'5638',
  u'reported_ts': 1474422045.947868,
  u'stat': u'sync_duration',
  u'ts': 1474414807.961,
  u'user': UUID('96af3842-d5fb-4f13-aea0-726efaeba6ea')}]

In [1014]: arrow.get(1474422045.947868)
Out[1014]: <Arrow [2016-09-21T01:40:45.947868+00:00]>

So far, so good. The write_ts works just fine. However, the ts is broken because in this case, the ts is already a second even in the client_stats_db. It looks like the ts is only incorrect iff it is stored as a string.

This code seems to work.

In [91]: for e in edb.get_client_stats_db_backup().find().limit(100):
     e_ts = e['ts']
     if type(e_ts) == str:
         print("string, after converting ms -> sec %s" % arrow.get(float(e_ts)/1000))
     else:
         print("float, no real need to convert %s" % arrow.get(float(e_ts)))

string, after converting ms -> sec 2014-10-14T02:49:04.995000+00:00
string, after converting ms -> sec 2014-10-14T02:49:06.390000+00:00
string, after converting ms -> sec 2014-10-14T02:49:06.390000+00:00
string, after converting ms -> sec 2014-10-14T02:49:06.390000+00:00
....
float, no real need to convert 2016-09-22T11:40:11.586000+00:00
float, no real need to convert 2016-09-22T12:40:13.430000+00:00
float, no real need to convert 2016-09-22T13:40:12.354000+00:00
float, no real need to convert 2016-09-22T14:40:13.595000+00:00
float, no real need to convert 2016-09-22T15:40:12.583000+00:00
float, no real need to convert 2016-09-22T16:40:14.230000+00:00
float, no real need to convert 2016-09-22T17:40:13.844000+00:00

Let's also store the metadata.fmt_time and metadata.local_date fields to be consistent with new data.

shankari commented 6 years ago

Re-ran locally - let's re-confirm. Works!

In [98]: list(edb.get_timeseries_db().find({"user_id": UUID('96af3842-d5fb-4f13-aea0-726
    ...: efaeba6ea'), "metadata.key": {"$in": ["stats/client_time", "stats/client_nav_ev
    ...: ent", "stats/client_error"]}}).sort("data.ts", -1).limit(1))
Out[98]:
[{'_id': ObjectId('5a4bef34f6858f4aeaa1cfde'),
  'data': {'client_app_version': '1.2.0',
   'client_os_version': '4.4.2',
   'name': 'sync_launched',
   'reading': None,
   'ts': 1478058014.172},
  'metadata': {'key': 'stats/client_nav_event',
   'platform': 'server',
   'time_zone': 'America/Los_Angeles',
   'write_fmt_time': '2016-11-01T20:40:18.196038-07:00',
   'write_local_dt': {'day': 1,
    'hour': 20,
    'minute': 40,
    'month': 11,
    'second': 18,
    'timezone': 'America/Los_Angeles',
    'weekday': 1,
    'year': 2016},
   'write_ts': 1478058018.196038},
  'user_id': UUID('96af3842-d5fb-4f13-aea0-726efaeba6ea')}]

In [99]: list(edb.get_timeseries_db().find({"user_id": UUID('96af3842-d5fb-4f13-aea0-726
    ...: efaeba6ea'), "metadata.key": {"$in": ["stats/client_time", "stats/client_nav_ev
    ...: ent", "stats/client_error"]}}).sort("data.ts", 1).limit(1))
Out[99]:
[{'_id': ObjectId('5a4becb8f6858f4aea9fda54'),
  'data': {'client_app_version': '1.0.0',
   'client_os_version': '4.4.2',
   'name': 'sync_duration',
   'reading': 5638.0,
   'ts': 1474414807.961},
  'metadata': {'key': 'stats/client_time',
   'platform': 'server',
   'time_zone': 'America/Los_Angeles',
   'write_fmt_time': '2016-09-20T18:40:45.947868-07:00',
   'write_local_dt': {'day': 20,
    'hour': 18,
    'minute': 40,
    'month': 9,
    'second': 45,
    'timezone': 'America/Los_Angeles',
    'weekday': 1,
    'year': 2016},
   'write_ts': 1474422045.947868},
  'user_id': UUID('96af3842-d5fb-4f13-aea0-726efaeba6ea')}]

In [100]: arrow.get(1474414807.961)
Out[100]: <Arrow [2016-09-20T23:40:07.961000+00:00]>

In [101]: arrow.get(1478058014.172)
Out[101]: <Arrow [2016-11-02T03:40:14.172000+00:00]>

Note that the oldest entry is from 2016-09-20. I suspect that is because I re-created the UUIDs at that time and didn't change the uuids in the stats databases, only the timeseries. Let's confirm.

The oldest entries overall are from 2015.

In [107]: first_10_entries = list(edb.get_timeseries_db().find({"metadata.key": {"$in":
     ...: ["stats/client_time", "stats/client_nav_event", "stats/client_error"]}}).sort(
     ...: "data.ts", 1).limit(10))

In [109]: [e["metadata"]["write_fmt_time"] for e in first_10_entries]
Out[109]:
['2015-06-03T09:57:27.061417-07:00',
 '2015-06-03T09:57:27.061417-07:00',
 '2015-06-09T09:42:59.953313-07:00',
 '2015-06-09T09:42:59.953313-07:00',
 '2015-06-11T07:19:21.271020-07:00',
 '2015-06-11T07:19:21.271020-07:00',
 '2015-06-09T09:42:59.953313-07:00',
 '2015-06-11T07:19:21.271020-07:00',
 '2015-06-03T09:57:27.061417-07:00',
 '2015-06-03T09:57:27.061417-07:00']

The corresponding UUID is

In [114]: [e["user_id"] for e in first_10_entries]
Out[114]:
[UUID('077bbe67-7693-3a09-9b3c-c8ba935e46ee'),
...

which doesn't exist in the uuid_db

In [1019]: edb.get_uuid_db().find_one({"uuid": UUID("077bbe67-7693-3a09-9b3c-c8ba935e46ee")})

In [1020]:

In contrast, if we try the last 10 entries

In [115]: [e["user_id"] for e in last_10_entries]
Out[115]:
[UUID('3d37573b-2e74-496d-a09d-a0a3f05c2467'),
 UUID('6560950c-4ddb-41fc-8801-b9197d30f54d'),
 UUID('365fc61a-033c-49ec-8f47-6fc5fbc36151'),
 UUID('4cd90e57-a116-448f-8cba-6a48f68fb554'),
 UUID('8f1a6737-e8ee-4769-a1a4-da1948d886b6'),
 UUID('8f1a6737-e8ee-4769-a1a4-da1948d886b6'),
 UUID('8f1a6737-e8ee-4769-a1a4-da1948d886b6'),
 UUID('8f1a6737-e8ee-4769-a1a4-da1948d886b6'),
 UUID('1ad0a2d9-10c3-47d4-9b5e-36cded2d563d'),
 UUID('1d2c74be-3480-4137-b708-9789f1489d1b')]

They are very much there.

In [1020]: edb.get_uuid_db().find_one({"uuid": UUID("3d37573b-2e74-496d-a09d-a0a3f05c2467")})
Out[1020]:
{u'_id': ObjectId('543c8a2239e59673fd9fb9dd'),
 u'update_ts': datetime.datetime(2016, 7, 27, 16, 32, 52, 494000),
 u'user_email': u'<redacted>',
 u'uuid': UUID('3d37573b-2e74-496d-a09d-a0a3f05c2467')}

shankari commented 6 years ago

ok so the first n entries will not exist in the uuid db and cannot be retrieved without a bunch of additional work. It is not clear that we need that mapping anyway since we will primarily use this data to evaluate performance. And we can defer the additional work to if/when we actually need it. This has already taken up a bunch of time!

shankari commented 6 years ago

ok so let's restart mongodb with the new data directory.

$ grep -i dbpath /etc/mongod.conf
# dbpath=/mnt/e-mission-primary-db/mongodb
dbpath=/mnt/e-mission-primary-db/mongodb.restore

$ sudo service mongod restart

As expected, there is no data.

In [1023]: edb.get_uuid_db().find().count()
Out[1023]: 0

In [1024]: edb.get_timeseries_db().find().count()
Out[1024]: 0

In [1025]: edb.get_analysis_timeseries_db().find().count()
Out[1025]: 0

Now start restoring. Took much longer than the dump.

ubuntu@ip-10-203-173-119:~$ mongorestore /transfer/original-jan-1
connected to: 127.0.0.1
2018-01-03T00:20:32.239+0000 /transfer/original-jan-1/Backup_database/Stage_Sections.bson
2018-01-03T05:00:46.118+0000    Creating index: { key: { _id: 1 }, name: "_id_", ns: "Stage_database.Stage_section_new" }

Now, we dump client and server stats values from the stats database for the test phones. First we need to get the bin values for the uuids.

> db.Stage_uuids.find({"user_email": "ucb.sdb.android.1@gmail.com"})
{ "_id" : ObjectId("565872797d65cb39ee9e89aa"), "update_ts" : ISODate("2017-08-30T00:36:29.359Z"), "uuid" : BinData(3,"5HFxHr0UPb6Atpx9kuzClg=="), "user_email" : "ucb.sdb.android.1@gmail.com" }
> db.Stage_uuids.find({"user_email": "ucb.sdb.android.2@gmail.com"})
{ "_id" : ObjectId("565872827d65cb39ee9e89ac"), "update_ts" : ISODate("2017-08-04T20:31:27.378Z"), "uuid" : BinData(3,"/XtMLiyLO/qU8NHj7L1ftw=="), "user_email" : "ucb.sdb.android.2@gmail.com" }
> db.Stage_uuids.find({"user_email": "ucb.sdb.android.3@gmail.com"})
{ "_id" : ObjectId("565872877d65cb39ee9e89ae"), "update_ts" : ISODate("2017-08-28T21:50:47.581Z"), "uuid" : BinData(3,"hoQsNdooMu2pDi2mZjxccw=="), "user_email" : "ucb.sdb.android.3@gmail.com" }
> db.Stage_uuids.find({"user_email": "ucb.sdb.androi.4@gmail.com"})
{ "_id" : ObjectId("56fad3dd1268798022665a00"), "update_ts" : ISODate("2017-08-30T00:15:17.402Z"), "uuid" : BinData(3,"O8D5H3ZgNKKwBVw5lZijaQ=="), "user_email" : "ucb.sdb.androi.4@gmail.com" }
> db.Stage_uuids.find({"user_email": "ucb.sdb.iphone.1@gmail.com"})
{ "_id" : ObjectId("5530d20e077c9e50eb0aa61b"), "update_ts" : ISODate("2017-08-25T22:31:33.765Z"), "uuid" : BinData(3,"B54PGsRAPXyw594WD3SONQ=="), "user_email" : "ucb.sdb.iphone.1@gmail.com" }
> db.Stage_uuids.find({"user_email": "ucb.sdb.iphone.2@gmail.com"})
{ "_id" : ObjectId("56555db67d65cb39ee9e1299"), "update_ts" : ISODate("2017-10-01T04:21:49.675Z"), "uuid" : BinData(3,"x2oEh35aOxekSUe+Zms29g=="), "user_email" : "ucb.sdb.iphone.2@gmail.com" }
> db.Stage_uuids.find({"user_email": "cub.sdb.iphone.3@gmail.com"})
{ "_id" : ObjectId("565942347d65cb39ee9eab29"), "update_ts" : ISODate("2017-10-01T04:26:21.266Z"), "uuid" : BinData(3,"xSi80qiLPoK+Yu9PI5aWeg=="), "user_email" : "cub.sdb.iphone.3@gmail.com" }
> db.Stage_uuids.find({"user_email": "ucb.sdb.iphone.4@gmail.com"})
{ "_id" : ObjectId("56faec5a126879802266adc5"), "update_ts" : ISODate("2017-10-01T04:24:58.701Z"), "uuid" : BinData(3,"lecHJ6BOPjO3/jSrGRlPiw=="), "user_email" : "ucb.sdb.iphone.4@gmail.com" }
> db.Stage_uuids.find({"user_email": "nexus7itu01@gmail.com"})
{ "_id" : ObjectId("57e4e958cb17471ac05d7bef"), "update_ts" : ISODate("2016-09-23T08:35:36.545Z"), "uuid" : BinData(3,"cJaAaNulQGyOJgm1SNoOSw=="), "user_email" : "nexus7itu01@gmail.com" }
> db.Stage_uuids.find({"user_email": "nexus7itu02@gmail.com"})
{ "_id" : ObjectId("57e4e888cb17471ac05d7bed"), "update_ts" : ISODate("2016-09-23T08:32:08.502Z"), "uuid" : BinData(3,"ZWFDH9TBTg+UiasRkDQftw=="), "user_email" : "nexus7itu02@gmail.com" }
> db.Stage_uuids.find({"user_email": "motoeitu01@gmail.com"})
{ "_id" : ObjectId("57e11e84cb17471ac05a01e0"), "update_ts" : ISODate("2016-09-20T11:33:24.528Z"), "uuid" : BinData(3,"ks9YQK9ZQAyrcrqz3N94GA=="), "user_email" : "motoeitu01@gmail.com" }
> db.Stage_uuids.find({"user_email": "motoeitu02@gmail.com"})
{ "_id" : ObjectId("57e14471cb17471ac05a0690"), "update_ts" : ISODate("2016-09-20T14:15:13.443Z"), "uuid" : BinData(3,"k+ihzDIfT6mMPEaShmjkXQ=="), "user_email" : "motoeitu02@gmail.com" }

Next, we mongodump based on the bin values.

client:

$ mongodump -d Stage_database -c Stage_client_stats  -o /transfer/public_phone_stats -q '{"user": {"$in": [{"$binary": "5HFxHr0UPb6Atpx9kuzClg==", "$type":"03"}, {"$binary": "/XtMLiyLO/qU8NHj7L1ftw==", "$type":"03"}, {"$binary": "hoQsNdooMu2pDi2mZjxccw==", "$type":"03"}, {"$binary": "O8D5H3ZgNKKwBVw5lZijaQ==", "$type":"03"}, {"$binary": "B54PGsRAPXyw594WD3SONQ==", "$type":"03"}, {"$binary": "x2oEh35aOxekSUe+Zms29g==", "$type":"03"}, {"$binary": "xSi80qiLPoK+Yu9PI5aWeg==", "$type":"03"}, {"$binary": "lecHJ6BOPjO3/jSrGRlPiw==", "$type":"03"}, {"$binary": "cJaAaNulQGyOJgm1SNoOSw==", "$type":"03"}, {"$binary": "ZWFDH9TBTg+UiasRkDQftw==", "$type":"03"}, {"$binary": "ks9YQK9ZQAyrcrqz3N94GA==", "$type":"03"}, {"$binary": "k+ihzDIfT6mMPEaShmjkXQ==", "$type":"03"}]}}'
connected to: 127.0.0.1
2018-01-03T06:06:52.363+0000 DATABASE: Stage_database    to   /transfer/public_phone_stats/Stage_database
2018-01-03T06:06:52.363+0000    Stage_database.Stage_client_stats to /transfer/public_phone_stats/Stage_database/Stage_client_stats.bson
2018-01-03T06:06:52.782+0000             72599 documents
2018-01-03T06:06:52.782+0000    Metadata for Stage_database.Stage_client_stats to /transfer/public_phone_stats/Stage_database/Stage_client_stats.metadata.json

server:

$ mongodump -d Stage_database -c Stage_server_stats  -o /transfer/public_phone_stats -q '{"user": {"$in": [{"$binary": "5HFxHr0UPb6Atpx9kuzClg==", "$type":"03"}, {"$binary": "/XtMLiyLO/qU8NHj7L1ftw==", "$type":"03"}, {"$binary": "hoQsNdooMu2pDi2mZjxccw==", "$type":"03"}, {"$binary": "O8D5H3ZgNKKwBVw5lZijaQ==", "$type":"03"}, {"$binary": "B54PGsRAPXyw594WD3SONQ==", "$type":"03"}, {"$binary": "x2oEh35aOxekSUe+Zms29g==", "$type":"03"}, {"$binary": "xSi80qiLPoK+Yu9PI5aWeg==", "$type":"03"}, {"$binary": "lecHJ6BOPjO3/jSrGRlPiw==", "$type":"03"}, {"$binary": "cJaAaNulQGyOJgm1SNoOSw==", "$type":"03"}, {"$binary": "ZWFDH9TBTg+UiasRkDQftw==", "$type":"03"}, {"$binary": "ks9YQK9ZQAyrcrqz3N94GA==", "$type":"03"}, {"$binary": "k+ihzDIfT6mMPEaShmjkXQ==", "$type":"03"}]}}'
connected to: 127.0.0.1
2018-01-03T06:08:11.594+0000 DATABASE: Stage_database    to   /transfer/public_phone_stats/Stage_database
2018-01-03T06:08:11.595+0000    Stage_database.Stage_server_stats to /transfer/public_phone_stats/Stage_database/Stage_server_stats.bson
2018-01-03T06:08:11.889+0000             52593 documents
2018-01-03T06:08:11.889+0000    Metadata for Stage_database.Stage_server_stats to /transfer/public_phone_stats/Stage_database/Stage_server_stats.metadata.json

Pull the data locally and load it to confirm that we have the relevant data. Confirmed.

In [3]: len(edb.get_client_stats_db_backup().find().distinct("user"))
Out[3]: 12

In [4]: len(edb.get_server_stats_db_backup().find().distinct("user"))
Out[4]: 12

shankari commented 6 years ago

Ok so now back to cleaning up data (based on https://github.com/e-mission/e-mission-server/issues/530#issuecomment-354711521).

This time, we port the stats first, so that they will be deleted for the public phones, and we can delete the stats collections when we are done.

$ ./e-mission-py.bash bin/historical/migrations/stats_from_db_to_ts.py

Convert all ms to secs before validation.

for e in edb.get_timeseries_db().find({"metadata.key": {"$in": ["stats/client_time", "stats/client_nav_event", "stats/client_error"]}, "data.ts": {"$gt": now.timestamp}}):
         edb.get_timeseries_db().update({"_id": e["_id"]},
                 {"$set": {"data.ts": float(e["data"]["ts"])/1000,
                   "metadata.write_ts": float(e["metadata"]["write_ts"])/1000}})

validate

messed up entries have been fixed

In [1039]: edb.get_timeseries_db().find({"metadata.key": {"$in": ["stats/client_time", "stats/client_nav_event", "stats/client_error"]}, "data.ts": {"$gt": now.timestamp}}).count()
Out[1039]: 0

Correlation between stats db and timeseries db exists and timestamp is valid

 In [1040]: entry_1 = edb.get_client_stats_db_backup().find_one()

 In [1041]: edb.get_timeseries_db().find_one({"metadata.key": "background/battery", "data.ts": float(entry_1["ts"])/1000})
 Out[1041]:
 {u'_id': ObjectId('5a4c7b5c88f663668630d290'),
 u'data': {u'battery_level_pct': 4.0, u'ts': 1413254944.995},
 u'metadata': {u'key': u'background/battery',
 u'platform': u'server',
 u'time_zone': u'America/Los_Angeles',
 u'write_fmt_time': u'2014-10-13T19:49:42.201671-07:00',
 u'write_ts': 1413254982.201671},
 u'user_id': UUID('f8fee20c-0f32-359d-ba75-bce97a7ac83b')}

In [1044]: arrow.get(1413254944.995)
Out[1044]: <Arrow [2014-10-14T02:49:04.995000+00:00]>

Sorting in descending order works and timestamps are valid

In [1042]: list(edb.get_timeseries_db().find({"user_id": UUID('96af3842-d5fb-4f13-aea0-726efaeba6ea'), "metadata.key": {"$in": ["stats/client_time", "stats/client_nav_event", "stats/client_error"]}}).sort("data.ts", -1).limit(1))
Out[1042]:
[{u'_id': ObjectId('59a0c57dcb17471ac08bc0c6'),
  u'data': {u'client_app_version': u'2.3.0',
   u'client_os_version': u'7.0',
   u'name': u'sync_duration',
   u'reading': 6.459,
   u'ts': 1503704747.497},
  u'metadata': {u'key': u'stats/client_time',
   u'platform': u'android',
   u'read_ts': 0,
   u'time_zone': u'America/Los_Angeles',
   u'type': u'message',
   u'write_fmt_time': u'2017-08-25T16:45:47.499000-07:00',
   u'write_ts': 1503704747.499},
  u'user_id': UUID('96af3842-d5fb-4f13-aea0-726efaeba6ea')}]

In [1045]: arrow.get(1503704747.497)
Out[1045]: <Arrow [2017-08-25T23:45:47.497000+00:00]>

In [1047]: list(edb.get_timeseries_db().find({"metadata.key": {"$in": ["stats/client_time", "stats/client_nav_event", "stats/client_error"]}}).sort("data.ts", -1).limit(1))
Out[1047]:
[{u'_id': ObjectId('59efe782cb17471ac0cecb5a'),
  u'data': {u'client_app_version': u'2.4.0',
   u'client_os_version': u'10.3.3',
   u'name': u'sync_launched',
   u'reading': -1,
   u'ts': 1508894590.151824},
  u'metadata': {u'key': u'stats/client_nav_event',
   u'platform': u'ios',
   u'plugin': u'none',
   u'read_ts': 0,
   u'time_zone': u'America/Los_Angeles',
   u'type': u'message',
   u'write_fmt_time': u'2017-10-24T18:23:10.152289-07:00',
   u'write_local_dt': {u'day': 24,
    u'hour': 18,
    u'minute': 23,
    u'month': 10,
    u'second': 10,
    u'timezone': u'America/Los_Angeles',
    u'weekday': 1,
    u'year': 2017},
   u'write_ts': 1508894590.152289},
  u'user_id': UUID('7161343e-551e-4213-be75-3b82e1ce2448')}]

In [1048]: arrow.get(1508894590.151824)
Out[1048]: <Arrow [2017-10-25T01:23:10.151824+00:00]>

Sorting in ascending order works but one of the timestamps is weird.

In [1043]: list(edb.get_timeseries_db().find({"user_id": UUID('96af3842-d5fb-4f13-aea0-726efaeba6ea'), "metadata.key": {"$in": ["stats/client_time", "stats/client_nav_event", "stats/client_error"]}}).sort("data.ts", 1).limit(1))
Out[1043]:
[{u'_id': ObjectId('5a4c829288f663668638cb87'),
  u'data': {u'client_app_version': u'1.0.0',
   u'client_os_version': u'4.4.2',
   u'name': u'sync_duration',
   u'reading': 5638.0,
   u'ts': 1474414807.961},
  u'metadata': {u'key': u'stats/client_time',
   u'platform': u'server',
   u'time_zone': u'America/Los_Angeles',
   u'write_fmt_time': u'2016-09-20T18:40:45.947868-07:00',
   u'write_local_dt': {u'day': 20,
    u'hour': 18,
    u'minute': 40,
    u'month': 9,
    u'second': 45,
    u'timezone': u'America/Los_Angeles',
    u'weekday': 1,
    u'year': 2016},
   u'write_ts': 1474422045.947868},
  u'user_id': UUID('96af3842-d5fb-4f13-aea0-726efaeba6ea')}]

In [1046]: arrow.get(1474414807.961)
Out[1046]: <Arrow [2016-09-20T23:40:07.961000+00:00]>

In [1049]: list(edb.get_timeseries_db().find({"metadata.key": {"$in": ["stats/client_time", "stats/client_nav_event", "stats/client_error"]}}).sort("data.ts", 1).limit(1))
Out[1049]:
[{u'_id': ObjectId('5a4c7e6b88f663668633e5dd'),
  u'data': {u'client_app_version': u'2.1',
   u'client_os_version': u'4.4.2',
   u'name': u'confirmlist_auth_not_done',
   u'reading': None,
   u'ts': 315965026.452},
  u'metadata': {u'key': u'stats/client_nav_event',
   u'platform': u'server',
   u'time_zone': u'America/Los_Angeles',
   u'write_fmt_time': u'2015-06-03T09:57:27.061417-07:00',
   u'write_local_dt': {u'day': 3,
    u'hour': 9,
    u'minute': 57,
    u'month': 6,
    u'second': 27,
    u'timezone': u'America/Los_Angeles',
    u'weekday': 2,
    u'year': 2015},
   u'write_ts': 1433350647.061417},
  u'user_id': UUID('077bbe67-7693-3a09-9b3c-c8ba935e46ee')}]

In [1050]: arrow.get(315965026.452)
Out[1050]: <Arrow [1980-01-06T00:03:46.452000+00:00]>

what is going on with this?

This is not just an invalid conversion, though, because trying to convert it back to seconds does not work.

```
In [1051]: arrow.get(315965026452)
ValueError: year is out of range
```

There are 13 such entries and they are all from user UUID('077bbe67-7693-3a09-9b3c-c8ba935e46ee')

In [1067]: edb.get_timeseries_db().find({"metadata.key": {"$in": ["stats/client_time", "stats/client_nav_event", "stats/client_error"]}, "data.ts": {"$lt": ts_2000.timestamp}}).count()
Out[1067]: 13

In [1070]: edb.get_timeseries_db().find({"metadata.key": {"$in": ["stats/client_time", "stats/client_nav_event", "stats/client_error"]}, "data.ts": {"$lt": ts_2000.timestamp}}).distinct("user_id")
Out[1070]: [UUID('077bbe67-7693-3a09-9b3c-c8ba935e46ee')]

There are 131 entries for this user in the client stats DB

In [1071]: edb.get_client_stats_db_backup().find({"user": UUID('077bbe67-7693-3a09-9b3c-c8ba935e46ee')}).count()
Out[1071]: 151

Let's try to match based on the reported_ts. Bingo! The entry does indeed have an invalid ts.

In [1073]: edb.get_client_stats_db_backup().find_one({"user": UUID('077bbe67-7693-3a09-9b3c-c8ba935e46ee'), "reported_ts":  1433350647.061417})
Out[1073]:
{u'_id': ObjectId('556f31f788f6636f49a1b05a'),
 u'client_app_version': u'2.1',
 u'client_os_version': u'4.4.2',
 u'reading': u'0.0',
 u'reported_ts': 1433350647.061417,
 u'stat': u'battery_level',
 u'ts': u'315965055221',
 u'user': UUID('077bbe67-7693-3a09-9b3c-c8ba935e46ee')}

In [1075]: arrow.get(315965055221)
ValueError: year is out of range

There are a bunch of other entries with the same user and reported_ts, but fewer than the entries reported in the timeseries.

In [1074]: edb.get_client_stats_db_backup().find({"user": UUID('077bbe67-7693-3a09-9b3c-c8ba935e46ee'), "reported_ts":  1433350647.061417}).count()
Out[1074]: 37

I bet the others are battery_level, similar to the above.

In [1076]: edb.get_timeseries_db().find({"data.ts": {"$lt": ts_2000.timestamp}}).count()
Out[1076]: 762

In [1077]: edb.get_timeseries_db().find({"data.ts": {"$lt": ts_2000.timestamp}}).distinct("user_id")
Out[1077]:
[UUID('077bbe67-7693-3a09-9b3c-c8ba935e46ee'),
 UUID('99b29da2-989d-4e54-a211-284bde1d362d'),
 UUID('693a42f1-6f00-497e-8b40-a8339fd5af8d'),
 UUID('3d37573b-2e74-496d-a09d-a0a3f05c2467'),
 UUID('ee6dadba-53b6-421f-94d1-27bc96e023cf'),
 UUID('0109c47b-e640-411e-8d19-e481c52d7130'),
 UUID('6560950c-4ddb-41fc-8801-b9197d30f54d'),
 UUID('b82804b8-4e49-43a0-99d1-9d1da20ec1d3'),
 UUID('9d906275-8072-42d4-8dd2-3670e63e0f6e'),
 UUID('6af5afdf-b1d9-4ea7-9f10-2bddb8a0ecb3'),
 UUID('6a415e67-9025-4f29-b520-f0c5a43c8bb6'),
 UUID('a61349c6-0cc9-4902-9f13-d4236a630ad5'),
 UUID('cfbf03dc-6e3e-40bd-90de-d19d14613e47'),
 UUID('be47f46a-ce3a-4ad8-b81f-d3daa7955e95'),
 UUID('5109e62d-2152-481b-8d26-2cb8d8cc1f23'),
 UUID('f14272fe-1433-4430-b1ec-3f37dfdde5bf'),
 UUID('abf4ed3a-a018-4f40-90c5-39b592b8569b'),
 UUID('de23cac9-1996-4af5-8554-4f6d017b3459'),
 UUID('96af3842-d5fb-4f13-aea0-726efaeba6ea'),
 UUID('08b31565-f990-4d15-a4a7-89b3ba6b1340'),
 UUID('5a6a2711-c574-42f0-9940-ea1fd0cc2f09'),
 UUID('29277dd4-dc78-40c0-806c-f88a4f902436'),
 UUID('6ed1b36d-08a9-403d-b247-e426228c0492'),
 UUID('d2b923b9-68b9-4e88-9b8c-29416694efb1'),
 UUID('e82b1c5a-7c07-46b7-afd7-b53ac9db1f42'),
 UUID('dcdb5f74-071a-4e5b-a954-e613c5b46e5d'),
 UUID('ea59084e-11d4-4076-9252-3b9a29ce35e0'),
 UUID('cd6482fe-56a2-4bf8-b8a8-d74f6e3c22c8'),
 UUID('3ca88f7c-fb1a-467e-9e29-99909d92c904')]

In [1078]: edb.get_timeseries_db().find({"data.ts": {"$lt": ts_2000.timestamp}}).distinct("metadata.key")
Out[1078]:
[u'stats/client_time',
 u'background/battery',
 u'stats/client_nav_event',
 u'statemachine/transition',
 u'background/location',
 u'background/filtered_location']

In [1080]: edb.get_timeseries_db().find({"metadata.key": {"$in": ["statemachine/transition", "background/location", "background/filtered_location"]}, "data.ts": {"$lt": ts_2000.timestamp}}).count()
Out[1080]: 744

In [1082]: edb.get_timeseries_db().find({"metadata.key": {"$in": ["statemachine/transition", "background/location", "background/filtered_location"]}, "data.ts": {"$lt": ts_2000.timestamp}}).distinct("metadata.platform")
Out[1082]: [u'android']

Actually, no. There are a lot more, including a lot of non-stat entries, and they are all on android. Going to verify some of this manually and then move on. So they are an interesting mix of some kind of weird timestamp that is broken for both ts and write_ts and entries where the location apparently had ts = 0.

>>> list(edb.get_timeseries_db().find({"metadata.key": {"$in": ["statemachine/transition", "background/location", "background/filtered_location"]}, "data.ts": {"$lt": ts_2000.timestamp}}, {"user_id":1, "data.ts": 1, "data.fmt_time": 1, "metadata.write_ts": 1, "metadata.write_fmt_time": 1}).limit(10))

[{u'_id': ObjectId('56c495afeaedff78c762a711'),
  u'data': {u'fmt_time': u'1970-03-12T00:27:13+08:00', u'ts': 6020833},
  u'metadata': {u'write_fmt_time': u'1970-03-12T00:27:13+08:00',
   u'write_ts': 6020833},
  u'user_id': UUID('99b29da2-989d-4e54-a211-284bde1d362d')},
 {u'_id': ObjectId('56c495afeaedff78c762a710'),
  u'data': {u'fmt_time': u'1970-03-11T14:25:57+08:00', u'ts': 5984757},
  u'metadata': {u'write_fmt_time': u'1970-03-11T14:25:57+08:00',
   u'write_ts': 5984757},
  u'user_id': UUID('99b29da2-989d-4e54-a211-284bde1d362d')},
 {u'_id': ObjectId('575a6423383999ecb7a5e183'),
  u'data': {u'fmt_time': u'1969-12-31T16:00:00-08:00', u'ts': 0},
  u'metadata': {u'write_fmt_time': u'2016-06-09T23:43:11.391000-07:00',
   u'write_ts': 1465540991.391},
  u'user_id': UUID('693a42f1-6f00-497e-8b40-a8339fd5af8d')},
 {u'_id': ObjectId('5760c4dc383999ecb7a98155'),
  u'data': {u'fmt_time': u'1969-12-31T16:00:00-08:00', u'ts': 0},
  u'metadata': {u'write_fmt_time': u'2016-06-14T16:23:03.521000-07:00',
   u'write_ts': 1465946583.521},

At some point, I should go through and throw out all this data. But it is a small amount of data and can wait.

And to complete the exploration, all the broken stats are from the same user.

In [1089]: edb.get_timeseries_db().find({"metadata.key": {"$in": ["stats/client_time", "stats/client_nav_event", "stats/client_error", "background/battery"]}, "data.ts": {"$lt": ts_2000.timestamp}}).distinct("user_id")
Out[1089]: [UUID('077bbe67-7693-3a09-9b3c-c8ba935e46ee')]

Next, remove all test phone data, including pipeline states. Do NOT remove USP open data because there is data outside the open range too, at least for one user (me!)

State before

In [1053]: edb.get_uuid_db().find().count()
Out[1053]: 516

In [1054]: edb.get_timeseries_db().find().count()
Out[1054]: 54818879

In [1055]: edb.get_analysis_timeseries_db().find().count()
Out[1055]: 16573823

In [1056]: len(edb.get_pipeline_state_db().find().distinct('user_id'))
Out[1056]: 487

Running the script!

$ ./e-mission-py.bash bin/debug/purge_multi_timeline_for_range.py --pipeline-purge /tmp/public_data/dump_
INFO:root:Loading file or prefix /tmp/public_data/dump_
INFO:root:Found 12 matching files for prefix /tmp/public_data/dump_
INFO:root:files are ['/tmp/public_data/dump_fd7b4c2e-2c8b-3bfa-94f0-d1e3ecbd5fb7.gz', '/tmp/public_data/dump_6561431f-d4c1-4e0f-9489-ab1190341fb7.gz', '/tmp/public_data/dump_079e0f1a-c440-3d7c-b0e7-de160f748e35.gz', '/tmp/public_data/dump_92cf5840-af59-400c-ab72-bab3dcdf7818.gz', '/tmp/public_data/dump_3bc0f91f-7660-34a2-b005-5c399598a369.gz'] ... ['/tmp/public_data/dump_95e70727-a04e-3e33-b7fe-34ab19194f8b.gz', '/tmp/public_data/dump_c528bcd2-a88b-3e82-be62-ef4f2396967a.gz', '/tmp/public_data/dump_70968068-dba5-406c-8e26-09b548da0e4b.gz', '/tmp/public_data/dump_93e8a1cc-321f-4fa9-8c3c-46928668e45d.gz']
INFO:root:==================================================
INFO:root:Deleting data from file /tmp/public_data/dump_fd7b4c2e-2c8b-3bfa-94f0-d1e3ecbd5fb7.gz
...
INFO:root:For uuid = e471711e-bd14-3dbe-80b6-9c7d92ecc296, deleting entries from the timeseries
INFO:root:result = {u'ok': 1, u'n': 3923919}
INFO:root:For uuid = e471711e-bd14-3dbe-80b6-9c7d92ecc296, deleting entries from the analysis_timeseries
INFO:root:result = {u'ok': 1, u'n': 39361}
INFO:root:For uuid e471711e-bd14-3dbe-80b6-9c7d92ecc296, deleting entries from the user_db
INFO:root:result = {u'ok': 1, u'n': 1}
INFO:root:For uuid e471711e-bd14-3dbe-80b6-9c7d92ecc296, deleting entries from the pipeline_state_db
INFO:root:result = {u'ok': 1, u'n': 12}

State after

In [1057]: edb.get_uuid_db().find().count()
Out[1057]: 504

In [1058]: edb.get_timeseries_db().find().count()
Out[1058]: 39797029

In [1059]: edb.get_analysis_timeseries_db().find().count()
Out[1059]: 16190498

In [1060]: len(edb.get_pipeline_state_db().find().distinct('user_id'))
Out[1060]: 475

Next, remove all config documents from the usercache

In [1061]: edb.get_usercache_db().find({"metadata.type": "document", "metadata.key": {"$in": ['config/consent', 'config/sensor_config', 'config/sync_config']}}).count()
Out[1061]: 239

In [1062]: edb.get_usercache_db().remove({"metadata.type": "document", "metadata.key": {"$in": ['config/consent', 'config/sensor_config', 'config/sync_config']}})
Out[1062]: {u'n': 239, u'ok': 1}

Next, re-remove unused collections. This time, since we have migrated all stats, we can remove those databases as well.

In [1090]: edb.get_alternatives_db().remove()
Out[1090]: {u'n': 101146, u'ok': 1}

In [1091]: edb.get_client_db().remove()
Out[1091]: {u'n': 3, u'ok': 1}

In [1092]: edb.get_common_place_db().remove()
Out[1092]: {u'n': 0, u'ok': 1}

In [1093]: edb.get_common_trip_db().remove()
Out[1093]: {u'n': 0, u'ok': 1}

In [1094]: edb.get_pending_signup_db().remove()
Out[1094]: {u'n': 25, u'ok': 1}

In [1095]: edb._get_current_db().Stage_place.remove()
Out[1095]: {u'n': 0, u'ok': 1}

In [1096]: edb.get_routeCluster_db().remove()
Out[1096]: {u'n': 90, u'ok': 1}

In [1097]: edb._get_current_db().Stage_routeDistanceMatrix.remove()
Out[1097]: {u'n': 7, u'ok': 1}

In [1098]: edb._get_current_db().Stage_section_new.remove()
Out[1098]: {u'n': 0, u'ok': 1}

In [1099]: edb._get_current_db().Stage_stop.remove()
Out[1099]: {u'n': 0, u'ok': 1}

In [1100]: edb._get_current_db().Stage_trip_new.remove()
Out[1100]: {u'n': 0, u'ok': 1}

In [1101]: edb._get_current_db().Stage_user_moves_access.remove()
Out[1101]: {u'n': 118, u'ok': 1}

In [1102]: edb._get_current_db().Stage_utility_models.remove()
Out[1102]: {u'n': 36, u'ok': 1}

In [1103]: edb._get_current_db().Stage_Worktime.remove()
Out[1103]: {u'n': 2662, u'ok': 1}

In [1104]: edb.get_client_stats_db_backup().remove()
Out[1104]: {u'n': 650961, u'ok': 1}

In [1105]: edb.get_server_stats_db_backup().remove()
Out[1105]: {u'n': 449523, u'ok': 1}

Ok! I think we are done! There's plenty of room on the transfer disk, so let's just create a new dump and keep the old dump as backup.

/dev/xvdg         296G   53G  228G  19% /transfer

$ mongodump --out /transfer/cleanedup-jan-3
2018-01-03T18:44:29.016+0000    Test_database.Test_Set to /transfer/cleanedup-jan-3/Test_database/Test_Set.bson
2018-01-03T18:44:29.018+0000             1 documents
2018-01-03T18:44:29.019+0000    Metadata for Test_database.Test_Set to /transfer/cleanedup-jan-3/Test_database/Test_Set.metadata.json
2018-01-03T18:44:29.019+0000 DATABASE: admin     to     /transfer/cleanedup-jan-3/admin

Dump is done!

Remaining steps:

attach volume to new server
mongorestore
re-run analysis pipeline
DONE!!!

shankari commented 6 years ago

Re-attach the volume to the new stack

Unmount from current stack
```
$ sudo umount /transfer
```
Snapshot
Create new volume in the correct region
Attach new volume to the database from the new stack

Mount the new volume

$ sudo mkdir -p /transfer
$ sudo chown ec2-user:ec2-user /transfer/
$ sudo mount /dev/xvdi /transfer/
$ ls /transfer
cleanedup-jan-3  lost+found  odc-usp-2017  original-jan-1  public_phone_stats

Restore

Validate

On old server

In [521]: edb.get_uuid_db().find().count()
Out[521]: 504

In [522]: edb.get_timeseries_db().find().count()
Out[522]: 39797029

In [523]: edb.get_analysis_timeseries_db().find().count()
Out[523]: 16190498

In [524]: len(edb.get_pipeline_state_db().find().distinct('user_id'))
Out[524]: 475

In [525]: edb.get_usercache_db().find().count()
Out[525]: 10011104

On new server

In [2]: edb.get_uuid_db().find().count()
Out[2]: 504

In [3]: edb.get_timeseries_db().find().count()
Out[3]: 39797135

In [4]: edb.get_analysis_timeseries_db().find().count()
Out[4]: 16190498

In [5]: len(edb.get_pipeline_state_db().find().distinct('user_id'))
Out[5]: 475

In [6]: edb.get_usercache_db().find().count()
Out[6]: 10011104

Check our favourite users

In [7]: edb.get_uuid_db().find_one({"user_email": "shankari@berkeley.edu"})
Out[7]:
{'_id': ObjectId('54a6bdfd39e59673fd9fba5b'),
 'update_ts': datetime.datetime(2017, 8, 20, 2, 29, 50, 275000),
 'user_email': '<shankari_email>',
 'uuid': UUID('<shankari_uuid>')}

In [10]: edb.get_uuid_db().find({"user_id": UUID('<shankari_uuid>')}).count()
Out[10]: 0

In [11]: edb.get_timeseries_db().find({"user_id": UUID('<shankari_uuid>')}).count()
Out[11]: 735156

In [12]: edb.get_analysis_timeseries_db().find({"user_id": UUID('<shankari_uuid>')}).count()
Out[12]: 166571

In [23]: list(edb.get_timeseries_db().find({"user_id": shankari_entry["uuid"]}).sort("data.ts", 1).limit(1))
Out[23]:
[{'_id': ObjectId('5614ee7d88f663584fa03131'),
  'data': {'_id': ObjectId('5614ee7d88f663584fa03131'),
   'exit_fmt_time': '2015-08-21T18:06:16.905000-07:00',
   'exit_ts': 1440205576.905,
   'location': {'coordinates': [-122.4426899, 37.7280596], 'type': 'Point'},
   'starting_trip': ObjectId('5614ee7d88f663584fa03132'),
   'user_id': UUID('0763de67-f61e-3f5d-90e7-518e69793954')},
  'metadata': {'key': 'segmentation/raw_place',
   'platform': 'server',
   'time_zone': 'America/Los_Angeles',
   'write_fmt_time': '2016-04-25T06:32:01.332099-07:00',
   'write_ts': 1461591121.332099},
  'user_id': UUID('<shankari_uuid>')}]

In [24]: list(edb.get_analysis_timeseries_db().find({"user_id": shankari_entry["uuid"]}).sort("data.ts", 1).limit(1))
Out[24]:
[{'_id': ObjectId('57e962fa88f66347503059e7'),
  'data': {'exit_fmt_time': '2015-07-13T15:25:56.852000-07:00',
   'exit_ts': 1436826356.852,
   'location': {'coordinates': [-122.0879696, 37.3885529], 'type': 'Point'},
   'source': 'DwellSegmentationTimeFilter',
   'starting_trip': ObjectId('57e962fa88f66347503059e8')},
  'metadata': {'key': 'segmentation/raw_place',
   'platform': 'server',
   'time_zone': 'America/Los_Angeles',
   'write_fmt_time': '2016-09-26T11:03:38.105793-07:00',
   'write_ts': 1474913018.105793},
  'user_id': UUID('<shankari_uuid>')}]

In [25]: list(edb.get_timeseries_db().find({"user_id": shankari_entry["uuid"]}).sort("data.ts", -1).limit(1))
Out[25]:
[{'_id': ObjectId('581becd188f6630386d15ac5'),
  'data': {'battery_level_pct': 98.0, 'ts': 1478225037729.0},
  'metadata': {'key': 'background/battery',
   'platform': 'server',
   'time_zone': 'America/Los_Angeles',
   'write_ts': 1478225037729.0},
  'user_id': UUID('<shankari_uuid>')}]

In [26]: list(edb.get_analysis_timeseries_db().find({"user_id": shankari_entry["uuid"]}).sort("data.ts", -1).limit(1))
Out[26]:
[{'_id': ObjectId('59eed4b188f66334694bfcb2'),
  'data': {'altitude': 1.0,
   'distance': 2.8965257248272045,
   'fmt_time': '2017-10-23T17:05:41-07:00',
   'heading': 31.494464698609224,
   'idx': 27,
   'latitude': 37.3909994,
   'loc': {'coordinates': [-122.0864596, 37.3909994], 'type': 'Point'},
   'longitude': -122.0864596,
   'mode': 0,
   'section': ObjectId('59eed4b188f66334694bfc96'),
   'speed': 0.12344985580418566,
   'ts': 1508803541.0},
  'metadata': {'key': 'analysis/recreated_location',
   'platform': 'server',
   'time_zone': 'America/Los_Angeles',
   'write_fmt_time': '2017-10-23T22:50:41.916285-07:00',
   'write_ts': 1508824241.916285},
  'user_id': UUID('<shankari_uuid>')}]

Tom

```
In [16]: tom_entry = edb.get_uuid_db().find_one({"user_email": "tamtom2000@gmail.com"})

In [17]: tom_entry
Out[17]:
{'_id': ObjectId('543c8a2239e59673fd9fb9dc'),
 'update_ts': datetime.datetime(2017, 5, 6, 21, 29, 3, 780000),
 'user_email': '<tom_email>',
 'uuid': UUID('<tom_uuid>')}

In [18]: edb.get_timeseries_db().find({"user_id": tom_entry["uuid"]}).count()
Out[18]: 592479

In [19]: edb.get_analysis_timeseries_db().find({"user_id": tom_entry["uuid"]}).count()
Out[19]: 139926

In [27]: list(edb.get_timeseries_db().find({"user_id": tom_entry["uuid"]}).sort("data.ts", 1).limit(1))
Out[27]:
[{'_id': ObjectId('564ed10488f66311474836bd'),
  'data': {'_id': ObjectId('564ed10488f66311474836bd'),
   'exit_fmt_time': '2015-07-21T00:56:30.414000-07:00',
   'exit_ts': 1437465390.414,
   'location': {'coordinates': [-122.0862835, 37.3909556], 'type': 'Point'},
   'starting_trip': ObjectId('564ed10488f66311474836be'),
   'user_id': UUID('b0d937d0-70ef-305e-9563-440369012b39')},
  'metadata': {'key': 'segmentation/raw_place',
   'platform': 'server',
   'time_zone': 'America/Los_Angeles',
   'write_fmt_time': '2016-04-25T06:32:01.391767-07:00',
   'write_ts': 1461591121.391767},
  'user_id': UUID('<tom_uuid>')}]

In [41]: list(edb.get_analysis_timeseries_db().find({"user_id": tom_entry["uuid"]}).sort("data.ts", 1).limit(1))
Out[41]:
[{'_id': ObjectId('57e9906b88f66347503240e7'),
  'data': {'exit_fmt_time': '2015-07-21T00:56:30.414000-07:00',
   'exit_ts': 1437465390.414,
   'location': {'coordinates': [-122.0862835, 37.3909556], 'type': 'Point'},
   'source': 'DwellSegmentationTimeFilter',
   'starting_trip': ObjectId('57e9906b88f66347503240e8')},
  'metadata': {'key': 'segmentation/raw_place',
   'platform': 'server',
   'time_zone': 'America/Los_Angeles',
   'write_fmt_time': '2016-09-26T14:17:31.475597-07:00',
   'write_ts': 1474924651.475597},
  'user_id': UUID('<tom_uuid>')}]

In [29]: list(edb.get_timeseries_db().find({"user_id": tom_entry["uuid"]}).sort("data.ts", -1).limit(1))
Out[29]:
[{'_id': ObjectId('581b4c5588f6630386d0ea56'),
  'data': {'battery_level_pct': 85.0, 'ts': 1478184016437.0},
  'metadata': {'key': 'background/battery',
   'platform': 'server',
   'time_zone': 'America/Los_Angeles',
   'write_ts': 1478184016437.0},
  'user_id': UUID('<tom_uuid>')}]

In [30]: list(edb.get_analysis_timeseries_db().find({"user_id": tom_entry["uuid"]}).sort("data.ts", -1).limit(1))
Out[30]:
[{'_id': ObjectId('59a23ae188f6632233d07f2f'),
  'data': {'altitude': 0.0,
   'distance': 1.0747850121465141,
   'fmt_time': '2017-08-25T16:21:37.192000-07:00',
   'heading': 134.91688221066724,
   'idx': 81,
   'latitude': 37.3910415,
   'loc': {'coordinates': [-122.0864408, 37.3910415], 'type': 'Point'},
   'longitude': -122.0864408,
   'mode': 4,
   'section': ObjectId('59a23ae188f6632233d07edd'),
   'speed': 0.10598412720427686,
   'ts': 1503703297.192},
  'metadata': {'key': 'analysis/recreated_location',
   'platform': 'server',
   'time_zone': 'America/Los_Angeles',
   'write_fmt_time': '2017-08-26T20:22:09.988126-07:00',
   'write_ts': 1503804129.988126},
  'user_id': UUID('<tom_uuid>')}]
```

I can see a couple of things that we should clean up. One which is easy and the other which should be done later.

First, we need to adjust the timestamps on the stats objects again. We fixed all the client entries but not the battery entries.

In [33]: edb.get_timeseries_db().find({"metadata.key": {"$in": ["background/battery"]},"data.ts": {"$gt": now.timestamp}}).count()
Out[33]: 625387

for e in edb.get_timeseries_db().find({"metadata.key": {"$in": ["background/battery"]}, "data.ts": {"$gt": now.timestamp}}):
         edb.get_timeseries_db().update({"_id": e["_id"]},
                 {"$set": {"data.ts": float(e["data"]["ts"])/1000,
                   "metadata.write_ts": float(e["metadata"]["write_ts"])/1000}})

In [35]: edb.get_timeseries_db().find({"metadata.key": {"$in": ["background/battery"]},"data.ts": {"$gt": now.timestamp}}).count()
Out[35]: 0

Now the most recent entries in the timeseries should be fixed.

In [36]: list(edb.get_timeseries_db().find({"user_id": shankari_entry["uuid"]}).sort("data.ts", -1).limit(1))
Out[36]:
[{'_id': ObjectId('5a4aaf3088f6636e03141d7a'),
  'data': {'name': 'POST_/usercache/get',
   'reading': 0.45238590240478516,
   'ts': 1514843952.94657},
  'metadata': {'key': 'stats/server_api_time',
   'platform': 'server',
   'time_zone': 'America/Los_Angeles',
'write_fmt_time': '2018-01-01T13:59:12.947125-08:00',
'write_ts': 1514843952.947125},
'user_id': UUID('ea59084e-11d4-4076-9252-3b9a29ce35e0')}]

In [37]: list(edb.get_timeseries_db().find({"user_id": tom_entry["uuid"]}).sort("data.ts", -1).limit(1))
Out[37]:
[{'_id': ObjectId('5a499a4688f6636e031416a9'),
  'data': {'name': 'POST_/datastreams/find_entries/timestamp',
   'reading': 0.06373810768127441,
   'ts': 1514773062.554465},
  'metadata': {'key': 'stats/server_api_time',
   'platform': 'server',
   'time_zone': 'America/Los_Angeles',
   'write_fmt_time': '2017-12-31T18:17:42.554857-08:00',
   'write_ts': 1514773062.554857},
'user_id': UUID('<tom_uuid>')}]

Second, we need to move all the segmentation/raw entries from the timeseries to the analysis_timeseries. They are not doing anything bad there - we assume trips and sections are only in the analysis database, and read them only from there.
```
"segmentation/raw_trip": self.analysis_timeseries_db,
```

However, it seems like a bad idea to have weird data sitting around. Is it actually duplicated in the analysis database? If so, can we just delete them from the timeseries_db?

Are they duplicated? For Tom, yes. for me, pretty close.

For me: Times are different, although ~ 1 month from each other. Locations are different.

[{'_id': ObjectId('5614ee7d88f663584fa03131'),
'data': {'_id': ObjectId('5614ee7d88f663584fa03131'),
 'exit_fmt_time': '2015-08-21T18:06:16.905000-07:00',
 'location': {'coordinates': [-122.4426899, 37.7280596], 'type': 'Point'},

[{'_id': ObjectId('57e962fa88f66347503059e7'),
'data': {'exit_fmt_time': '2015-07-13T15:25:56.852000-07:00',
 'location': {'coordinates': [-122.0879696, 37.3885529], 'type': 'Point'},

For Tom: only _ids are different. Eveything else is identical.

[{'_id': ObjectId('564ed10488f66311474836bd'),
'data': {'_id': ObjectId('564ed10488f66311474836bd'),
 'exit_fmt_time': '2015-07-21T00:56:30.414000-07:00',
 'location': {'coordinates': [-122.0862835, 37.3909556], 'type': 'Point'},

[{'_id': ObjectId('57e9906b88f66347503240e7'),
'data': {'exit_fmt_time': '2015-07-21T00:56:30.414000-07:00',
 'location': {'coordinates': [-122.0862835, 37.3909556], 'type': 'Point'},

How many of them are there?

  In [50]: edb.get_timeseries_db().find({"metadata.key": {"$in": ["segmentation/raw_place", "segmentation/raw_place", "segmentation/raw_section", "segmentation/raw_stop", "segmentation/raw_untracked"]}}).count()
  Out[50]: 29023

Is there really overlap?

First timeseries entry

  In [66]: list(edb.get_timeseries_db().find({"metadata.key": {"$in": ["segmentation/raw_place", "segmentation/raw_place", "segmentation/raw_section", "segmentation/raw_stop", "segmentation/raw_untracked"]}}).sort([("data.enter_ts", 1), ("data.exit_ts", 1), ("data.start_ts", 1), ("data.end_ts", 1)]).limit(1))
  Out[66]:
  [{'_id': ObjectId('5674839788f66340b2fb12b9'),
    'data': {'_id': ObjectId('5674839788f66340b2fb12b9'),
     'duration': 269.63700008392334,
     'end_fmt_time': '2015-07-13T15:30:26.489000-07:00',
     'end_loc': {'coordinates': [-122.0824345, 37.3790636], 'type': 'Point'},
     'end_stop': ObjectId('5674839788f66340b2fb12bb'),
     'end_ts': 1436826626.489,
     'sensed_mode': 0,
     'source': 'SmoothedHighConfidenceMotion',
     'start_fmt_time': '2015-07-13T15:25:56.852000-07:00',
     'start_loc': {'coordinates': [-122.0879696, 37.3885529], 'type': 'Point'},
     'start_ts': 1436826356.852,
     'trip_id': ObjectId('5674838188f66340b2fb0c9c'),
     'user_id': UUID('0763de67-f61e-3f5d-90e7-518e69793954')},
    'metadata': {'key': 'segmentation/raw_section',
     'platform': 'server',
     'time_zone': 'America/Los_Angeles',
     'write_fmt_time': '2016-04-25T06:34:22.352027-07:00',
     'write_ts': 1461591262.352027},
    'user_id': UUID('ea59084e-11d4-4076-9252-3b9a29ce35e0')}]

Last timeseries entry

  In [65]: list(edb.get_timeseries_db().find({"metadata.key": {"$in": ["segmentation/raw_place", "segmentation/raw_place", "segmentation/raw_section", "segmentation/raw_stop", "segmentation/raw_untracked"]}}).sort([("data.enter_ts", -1), ("data.exit_ts", -1), ("data.start_ts", -1), ("data.end_ts", -1)]).limit(1))
  Out[65]:
  [{'_id': ObjectId('571dfacb88f66333657dafb5'),
    'data': {'_id': ObjectId('571dfacb88f66333657dafb5'),
     'ending_trip': ObjectId('571dfacb88f66333657dafb4'),
     'enter_fmt_time': '2016-04-25T01:54:55.062190-07:00',
     'enter_ts': 1461574495.06219,
     'location': {'coordinates': [-122.2528321669644, 37.86827700681786],
      'type': 'Point'},
     'source': 'DwellSegmentationDistFilter',
     'user_id': UUID('788f46af-9e6d-300b-93e1-981ba9b3390b')},
    'metadata': {'key': 'segmentation/raw_place',
     'platform': 'server',
     'time_zone': 'America/Los_Angeles',
     'write_fmt_time': '2016-04-25T06:32:12.557781-07:00',
     'write_ts': 1461591132.557781},
    'user_id': UUID('43f9361e-1cb6-4026-99ba-458be357d245')}]

First analysis entry

  In [63]: list(edb.get_analysis_timeseries_db().find({"metadata.key": {"$in": ["segmentation/raw_place", "segmentation/raw_place", "segmentation/raw_section", "segmentation/raw_stop", "segmentation/raw_untracked"]}}).sort([("data.enter_ts", 1), ("data.exit_ts", 1), ("data.start_ts", 1), ("data.end_ts", 1)]).limit(1))
  Out[63]:
  [{'_id': ObjectId('57e9648a88f6634750306dc5'),
    'data': {'duration': 269.63700008392334,
     'end_fmt_time': '2015-07-13T15:30:26.489000-07:00',
     'end_loc': {'coordinates': [-122.0824345, 37.3790636], 'type': 'Point'},
     'end_stop': ObjectId('57e9648a88f6634750306dc7'),
     'end_ts': 1436826626.489,
     'sensed_mode': 0,
     'source': 'SmoothedHighConfidenceMotion',
     'start_fmt_time': '2015-07-13T15:25:56.852000-07:00',
     'start_loc': {'coordinates': [-122.0879696, 37.3885529], 'type': 'Point'},
     'start_ts': 1436826356.852,
     'trip_id': ObjectId('57e962fa88f66347503059e8')},
    'metadata': {'key': 'segmentation/raw_section',
     'platform': 'server',
     'time_zone': 'America/Los_Angeles',
     'write_fmt_time': '2016-09-26T11:10:18.444634-07:00',
     'write_ts': 1474913418.444634},
    'user_id': UUID('ea59084e-11d4-4076-9252-3b9a29ce35e0')}]

Last analysis entry

  In [64]: list(edb.get_analysis_timeseries_db().find({"metadata.key": {"$in": ["segmentation/raw_place", "segmentation/raw_place", "segmentation/raw_section", "segmentation/raw_stop", "segmentation/raw_untracked"]}}).sort([("data.enter_ts", -1), ("data.exit_ts", -1), ("data.start_ts", -1), ("data.end_ts", -1)]).limit(1))
  Out[64]:
  [{'_id': ObjectId('59db353088f6636450bb268e'),
    'data': {'duration': -1490206.003000021,
     'ending_trip': ObjectId('59db353088f6636450bb268d'),
     'enter_fmt_time': '2017-10-25T22:36:49.223000-07:00',
     'enter_ts': 1508996209.223,
     'exit_fmt_time': '2017-10-08T16:40:03.220000-07:00',
     'exit_ts': 1507506003.22,
     'location': {'coordinates': [-122.2579113, 37.873973], 'type': 'Point'},
     'source': 'DwellSegmentationTimeFilter',
     'starting_trip': ObjectId('59db353088f6636450bb268f')},
    'metadata': {'key': 'segmentation/raw_place',
     'platform': 'server',
     'time_zone': 'America/Los_Angeles',
     'write_fmt_time': '2017-10-09T01:37:04.036013-07:00',
     'write_ts': 1507538224.036013},
    'user_id': UUID('06f82876-4090-482f-a7be-91345df47bb2')}]

So it looks like we ran the pipeline in april, back when we were still storing entries to the timeseries. Then we split and re-ran the pipeline but did not delete the old entries. So the timeseries entries are from 2015-07-13T15:25:56.852000-07:00 to 2016-04-25T01:54:55.062190-07:00, while the analysis timeseries entries are from 2015-07-13T15:25:56.852000-07:00 -> 2017-10-08T16:40:03.220000-07:00. So there is a clear overlap and we can delete the entries from the timeseries.

Delete now or delete later?

Let's just delete now, while we still have backups sitting around.

In [67]: edb.get_timeseries_db().find({"metadata.key": {"$in": ["segmentation/raw_place", "segmentation/raw_place", "segmentation/raw_section", "segmentation/raw_stop", "segmentation/raw_untracked"]}}).count()
Out[67]: 29023

In [68]:  edb.get_timeseries_db().find({"metadata.key": {"$in": ["analysis/cleaned_place", "analysis/cleaned_trip", "analysis/cleaned_section", "analysis/cleaned_stop", "analysis/cleaned_untracked"]}}).count()
Out[68]: 0

In [69]:  edb.get_analysis_timeseries_db().find({"metadata.key": {"$in": ["analysis/cleaned_place", "analysis/cleaned_trip", "analysis/cleaned_section", "analysis/cleaned_stop", "analysis/cleaned_untracked"]}}).count()
Out[69]: 505455

It looks like the first run was pre-cleaned trips, so we only have to delete raw-*

In [76]: edb.get_timeseries_db().delete_many({"metadata.key": {"$in": ["segmentation/raw_place", "segmentation/raw_place", "segmentation/raw_section", "segmentation/raw_stop", "segmentation/raw_untracked"]}}).raw_result
Out[76]: {'n': 29023, 'ok': 1.0}

In [77]: edb.get_timeseries_db().find({"metadata.key": {"$in": ["segmentation/raw_place", "segmentation/raw_place", "segmentation/raw_section", "segmentation/raw_stop", "segmentation/raw_untracked"]}}).count()
Out[77]: 0

Ok, so now the oldest entries in the timeseries should be different.

In [78]: list(edb.get_timeseries_db().find({"user_id": shankari_entry["uuid"]}).sort("data.ts", 1).limit(1))
Out[78]:
[{'_id': ObjectId('564f7d6388f66343e476e832'),
  'data': {'deleted_points': [],
   'filtering_algo': 'SmoothZigzag',
   'outlier_algo': 'BoxplotOutlier',
   'section': ObjectId('564f7d3888f66343e476e518')},
  'metadata': {'key': 'analysis/smoothing',
   'platform': 'server',
   'time_zone': 'America/Los_Angeles',
   'write_fmt_time': '2015-11-20T12:06:59.574516-08:00',
   'write_local_dt': datetime.datetime(2015, 11, 20, 20, 6, 59, 574000),
   'write_ts': 1448050019.574516},
  'user_id': UUID('ea59084e-11d4-4076-9252-3b9a29ce35e0')}]

Oops. It is now another generated result. Let's query and delete these as well.

In [79]: edb.get_timeseries_db().find({"metadata.key": {"$in": ["analysis/smoothing"]}}).count()
Out[79]: 9033

In [82]: edb.get_timeseries_db().delete_many({"metadata.key": {"$in": ["analysis/smoothing"]}}).raw_result
Out[82]: {'n': 9033, 'ok': 1.0}

In [83]: edb.get_timeseries_db().find({"metadata.key": {"$in": ["analysis/smoothing"]}}).count()
Out[83]: 0

Ok, so now the oldest entries in the timeseries should be different. Argh, we missed segmentation/raw_trip in the first query.

In [85]: edb.get_timeseries_db().find({"metadata.key": {"$in": ["segmentation/raw_trip"]}}).count()
Out[85]: 7815

In [86]: edb.get_timeseries_db().delete_many({"metadata.key": {"$in": ["segmentation/raw_trip"]}}).raw_result
Out[86]: {'n': 7815, 'ok': 1.0}

In [87]: edb.get_timeseries_db().find({"metadata.key": {"$in": ["segmentation/raw_trip"]}}).count()
Out[87]: 0

Ok, so now do we have correct oldest entries in the timeseries. Yes, finally. Although it doesn't really have a ts.

In [88]: list(edb.get_timeseries_db().find({"user_id": shankari_entry["uuid"]}).sort("data.ts", 1).limit(1))
Out[88]:
[{'_id': ObjectId('579fb85988f66357dde496ae'),
  'data': {'approval_date': '2016-07-14',
   'category': 'emSensorDataCollectionProtocol',
   'protocol_id': '2014-04-6267'},
  'metadata': {'key': 'config/consent',
   'platform': 'android',
   'read_ts': 0,
   'time_zone': 'Pacific/Honolulu',
   'type': 'rw-document',
   'write_fmt_time': '2016-08-01T06:02:09.845000-10:00',
   'write_ts': 1470067329.845},
  'user_id': UUID('ea59084e-11d4-4076-9252-3b9a29ce35e0')}]

In [89]: list(edb.get_timeseries_db().find({"user_id": shankari_entry["uuid"], "data.ts": {"$exists": True}}).sort("data.ts", 1).limit(1))
Out[89]:
[{'_id': ObjectId('59822c3fcb17471ac0667b86'),
  'data': {'accuracy': 1086.116,
   'altitude': 0,
   'elapsedRealtimeNanos': 7479231065135,
   'filter': 'time',
   'fmt_time': '1969-12-31T16:00:00-08:00',
   'heading': 0,
   'latitude': -23.56214940547943,
   'loc': {'coordinates': [-46.72179579734802, -23.56214940547943],
    'type': 'Point'},
   'longitude': -46.72179579734802,
   'sensed_speed': 0,
   'ts': 0},
  'metadata': {'key': 'background/location',
   'platform': 'android',
   'read_ts': 0,
   'time_zone': 'America/Los_Angeles',
   'type': 'sensor-data',
   'write_fmt_time': '2017-08-02T12:01:48.909000-07:00',
   'write_ts': 1501700508.909},
  'user_id': UUID('ea59084e-11d4-4076-9252-3b9a29ce35e0')}]

shankari commented 6 years ago

ok, so now that we believe that that database is fine, we can run the pipeline again for the first time in forever.

unmount the transfer drive now that its job is done
```
$ sudo umount /transfer
$
```
detach volume
while that is running, turn on the pipeline after multiple months. After this runs successfully at least once, we can put it into a cronjob.

Note that we now have entries in the timeseries that are client stats only and have no uuid entry or any real data. These are zombie entries from before the uuid change. Are we ignoring these correctly?

A quick check shows that we just read from the UUID database.

def get_all_uuids():
    all_uuids = [e["uuid"] for e in edb.get_uuid_db().find()]
    return all_uuids

There are other methods that still use distinct, notably aggregate_timeseries.get_distinct_users and builtin_timeseries.get_uuid_list, but they don't seem to be used anywhere in the code. Let's remove them as part of cleanup so people are not tempted to use them. Alternatively, we can move zombie entries into a separate timeseries where they won't pollute anything.

Ok, so let's run this script!

$ ./e-mission-ipy.bash bin/intake_multiprocess.py 4 > /log/intake.stdinout.log 2>&1
$ date
Thu Jan  4 08:17:24 UTC 2018

It's taking a long time to even just get started. I wonder if we are using distinct somewhere...

Looking at the launcher logs, it is still iterating through the users and querying for the number of entries in the usercache. I don't even think we use that functionality and can probably get rid of it in the next release.

Ah so how processes have been launched. Hopefully this first run will be done by tomorrow morning.

Now check back on the volume - it has been detached
Delete both volumes and the snapshot. At this point, there is no unencrypted copy of the data. Will delete other server with encrypted data once everything is done.

shankari commented 6 years ago

We should also remove the filter_accuracy step from the real server since it is only applicable for test phones/open data.

shankari commented 6 years ago

Habitica API is at port 3000, so I have to open it as an outgoing port
Performance is significantly better. If filter accuracy is removed, then my data for 3 months is processed in ~ 15 mins. We might be able to get back to running every hour.

2018-01-04T12:02:11.764406+00:00**********UUID <shankari_uuid>: moving to long term**********
2018-01-04T12:04:29.687814+00:00**********UUID <shankari_uuid>: filter accuracy if needed**********
2018-01-04T12:26:01.563703+00:00**********UUID <shankari_uuid>: segmenting into trips**********
2018-01-04T12:33:06.101569+00:00**********UUID <shankari_uuid>: segmenting into sections**********
2018-01-04T12:33:33.701075+00:00**********UUID <shankari_uuid>: smoothing sections**********
2018-01-04T12:33:54.020444+00:00**********UUID <shankari_uuid>: cleaning and resampling timeline**********
2018-01-04T12:40:12.439267+00:00**********UUID <shankari_uuid>: checking active mode trips to autocheck habits**********

I'm also seeing some errors with saving data, need to make a pass through the errors.


Got error None while saving entry AttrDict({'_id': ObjectId('59f11f15cb17471ac0cfc059'), 'metadata': {'write_ts': 1508533079.587702, 'plugin': 'none', 'time_zone': 'America/Montreal', 'platform': 'ios', 'key': 'statemachine/transition', 'read_ts': 0, 'type': 'message'}, 'user_id': UUID('e95bfd0b-1cfc-4ea3-af01-c41dc2fad0ed'), 'data': {'transition': None, 'ts': 1508533079.587526, 'currState': 'STATE_ONGOING_TRIP'}}) -> None

shankari commented 6 years ago

AMPLab is using too many resources, so I have to trim my consumption. Let's finally finish cleaning up the old servers

shankari commented 6 years ago

Copied dump_team_trajectories.py off the old server. There shouldn't be anything else. One caveat is that the original data is 150 GB

143G    /mnt/e-mission-primary-db/mongodb

But the new dataset is only 20GB.

/dev/mapper/xvdf  3.0T   20G  3.0T   1% /data
/dev/mapper/xvdg   25G  633M   25G   3% /journal
/dev/mapper/xvdh   10G  638M  9.4G   7% /log

So what is missing?

shankari commented 6 years ago

It doesn't appear to be the data. These are the only differences in collections between the old and new databases. There is nothing missing except for system.indexes, which should be an auto-generated collection.

shankari commented 6 years ago

Yup! system.indexes is now deprecated. https://docs.mongodb.com/manual/reference/system-collections/#%3Cdatabase%3E.system.indexes

Deprecated since version 3.0: Access this data using listIndexes.

And listIndexes does have the data.

> db.Stage_timeseries.getIndexes()
[
        {
                "v" : 2,
                "key" : {
                        "_id" : 1
                },
                "name" : "_id_",
                "ns" : "Stage_database.Stage_timeseries"
        },
        {
                "v" : 2,
                "key" : {
                        "user_id" : "hashed"
                },
                "name" : "user_id_hashed",
                "ns" : "Stage_database.Stage_timeseries"
        },
        {
                "v" : 2,
                "key" : {
                        "metadata.key" : "hashed"
                },
                "name" : "metadata.key_hashed",
                "ns" : "Stage_database.Stage_timeseries"
        },
        {
                "v" : 2,
                "key" : {
                        "metadata.write_ts" : -1
                },
                "name" : "metadata.write_ts_-1",
                "ns" : "Stage_database.Stage_timeseries"
        },
        {
                "v" : 2,
                "key" : {
                        "data.ts" : -1
                },
                "name" : "data.ts_-1",
                "ns" : "Stage_database.Stage_timeseries",
                "sparse" : true
        },
...

shankari commented 6 years ago

Shutting down the old server now. RIP! You were a faithful friend and will be missed.

shankari commented 6 years ago

In the past 4 days, the compute has increased by $50. The storage has increased by $200. We need to turn off some storage. Wah! Wah! What if I lose something important?! I guess you just have to deal with it...

shankari commented 6 years ago

Deleted all related storage.

shankari commented 6 years ago

Even with all the deletions, we spent ~ $50/day. This is a problem, because we will then end up spending an additional $1040 for the rest of the month, and we have already spent ~ $1500. This also means that we won't be under $1000 for next month.

Since our reserved instances already cost $507, the m3/m4 legacy servers cost ~ $639, we have to keep our storage budget to under $500 to stay at my preferred 50% of my available budget.

The storage is mostly going towards the provisioned iOPS storage. I don't think I actually need 3GB.

Current storage is

Filesystem        Size  Used Avail Use% Mounted on
devtmpfs           30G   84K   30G   1% /dev
tmpfs              30G     0   30G   0% /dev/shm
/dev/xvda1        7.8G  2.3G  5.5G  30% /
/dev/mapper/xvdf  3.0T   21G  3.0T   1% /data
/dev/mapper/xvdg   25G  633M   25G   3% /journal
/dev/mapper/xvdh   10G  638M  9.4G   7% /log

We should be able to drop to:

/data: 200G

If we still need to reduce after that, we can change to:

/journal: 10G
/log: 5G

Let's see how easy it is to resize EBS volumes

shankari commented 6 years ago

Let's see how easy it is to resize EBS volumes

Not too hard, you just have to copy the data around. https://matt.berther.io/2015/02/03/how-to-resize-aws-ec2-ebs-volumes/ Let's turn off everything and start copying the data tomorrow morning

shankari commented 6 years ago

This was a bit trickier than one would expect because xfs does not support resize2fs and in fact, does not support shrinking the file system at all. So we had to follow the instructions to use the temp instance, as in the link above.

Note that our data is also on an encrypted filesystem, so our steps were:

unmount + turn encryption off for disk
detach
create new volume
create new instance
attach both volumes
mount old_data
crypt setup new volume
mount new data
copy data over using xfsdump and xfsrestore. Note that xfsdump does not support a / at the end of the mounted filename
chmod new_data to the same ids as old_data

At this point, the only diff between them is

--- /tmp/old_data_list  2018-07-11 10:07:36.038063450 +0000
+++ /tmp/new_data_list  2018-07-11 10:07:24.266143934 +0000
@@ -140,7 +140,7 @@
 -rw-r--r-- 1  498  497 4.0K Jul 11 09:23 index-94-2297001609533616747.wt
 -rw-r--r-- 1  498  497 4.0K Jul 11 09:23 index-96-2297001609533616747.wt
 -rw-r--r-- 1  498  497 4.0K Jul 11 09:23 index-98-2297001609533616747.wt
-lrwxrwxrwx 1 root root    8 Jan  1  2018 journal -> /journal
+lrwxrwxrwx 1 498 497    8 Jul 11 09:53 journal -> /journal
 -rw-r--r-- 1  498  497  48K Jul 11 09:23 _mdb_catalog.wt
 -rw-r--r-- 1  498  497    0 Jul 11 09:23 mongod.lock
 -rw-r--r-- 1  498  497  36K Jul 11 09:23 sizeStorer.wt

which makes sense

So now it's time to reverse the steps and attach new_data back to the server

shankari commented 6 years ago

Reversed steps, restarted server. No errors so far. Deleting old disk and migration instance.

shankari commented 6 years ago

Done. Closing this issue for now.

shankari commented 6 years ago

Burn rate is now $33/day 1419.11 - 1385.76 = 33.35

Should go down after we turn off air quality server but 33 * 15 ~ $500 so we are on track for $2000 for the month, not $1500 as originally planned

shankari commented 6 years ago

Burn rate is now roughly $13/day (1497 - 1419 = 78 over 6 days = $13/day) So this month should be $1497 + $156 = $1653 Next month should be 524 (reserved instances) + 13 * 30 = 524 + 390 = 914 (< $1000)