e-mission / e-mission-docs

Repository for docs and issues. If you need help, please file an issue here. Public conversations are better for open source projects than private email.
https://e-mission.readthedocs.io/en/latest
BSD 3-Clause "New" or "Revised" License
15 stars 34 forks source link

Remove 1:1 mapping between raw and filtered in the cleaning code #211

Open shankari opened 8 years ago

shankari commented 8 years ago

The 1:1 mapping between raw and filtered is really biting us - we can't insert a new UNKNOWN section to represent unknown interpolation, for example. We sometimes skip data while cleaning but not add. That is a limitation that needs to be addressed.

See https://github.com/e-mission/e-mission-server/issues/378#issuecomment-245962837 for an example

shankari commented 8 years ago

It really looks like this is the main issue left. It covers both https://github.com/e-mission/e-mission-server/issues/288#issuecomment-246049222 and https://github.com/e-mission/e-mission-server/issues/378#issuecomment-246036383 (assuming it was a real trip and not a turned off phone).

Basically, we know that there was a trip. We can extrapolate it to the end of the previous trip, but "if it is too far" (for some definition of far), it is not really the same section. We don't want to end up with walking trips of 40 or 55 km.

Clearly 40 or 55 is too large, but what is reasonable? We can't really use the % as an argument because we frequently use this for really short trips, like some of the iPhone ones that prompted the extrapolation in the first place.

It seems reasonable that if the speed of the extrapolated section is not consistent with the speed of the existing section, it should be split out. Doesn't have to depend on actual modes, just consistency. Basically, if it is not possible to extrapolate that section because the domain is different, let's make a new section...

shankari commented 8 years ago

Need to restructure a bunch of code to do that though. And we need to define consistency.

shankari commented 8 years ago

One way of defining consistency is to use the same outlier detection strategy that we used for the zigzags (basically, 75 percentile + MAJOR/MINOR iqr).

If only the code restructuring were so easy!

shankari commented 8 years ago

Code restructuring is really hard. We're set up for dropping/merging but not for expanding, and the dropping merging part has been pretty hard too. We may consider just setting the entire section as UNKNOWN if the initial speed is an outlier.

Let's try that and see how it works

shankari commented 8 years ago

Quick check on existing instances, as observed by the previous check.

2015-07-20T12:27:40.288000-07:00 -> 2015-07-20T12:28:08.270000-07:00 
Trying to extrapolate 9053.42681795 > 2 * original 22.5800646151, resetting add_dist = 45.1601292302
-------
2015-07-22T17:54:21.295000-07:00 -> 2015-07-22T17:54:21.295000-07:00
Trying to extrapolate 152.078210444 > 2 * original 0, resetting add_dist = 0
-------
015-08-10T13:33:10.166000-07:00 -> 2015-08-10T13:38:19.079000-07:00
Trying to extrapolate 222.200772719 > 2 * original 61.2697758242, resetting add_dist = 122.539551648
-------
2015-08-15T08:01:36.117000-07:00 -> 2015-08-15T08:24:22.215000-07:00
Trying to extrapolate 21136.3439166 > 2 * original 263.981657593, resetting add_dist = 527.963315187
-------
2015-09-29T10:17:23-07:00 -> 2015-09-29T10:18:53-07:00
Trying to extrapolate 7587.72883461 > 2 * original 0.14422679578, resetting add_dist = 0.288453591559
-------
2015-11-06T14:57:15-08:00 -> 2015-11-06T15:05:38-08:00
Trying to extrapolate 55970.668353 > 2 * original 21.463919706, resetting add_dist = 42.927839412
-------
2015-11-06T18:39:34-08:00 -> 2015-11-06T20:54:31-08:00
Trying to extrapolate 55974.4351837 > 2 * original 10.3389658458, resetting add_dist = 20.6779316916

-------
2016-08-26T10:47:23-07:00 -> 2016-08-26T10:47:23-07:00
Trying to extrapolate 186.035903544 > 2 * original 0, resetting add_dist = 0

-------
2016-08-26T12:22:30-07:00 -> 2016-08-26T12:28:08-07:00
Trying to extrapolate 223.156442709 > 2 * original 21.908850429, resetting add_dist = 43.8177008579

-------
2016-08-26T18:01:04-07:00 -> 2016-08-26T18:10:03-07:00
Trying to extrapolate 35674.1330809 > 2 * original 479.985531524, resetting add_dist = 959.971063048
-------
2016-09-03T08:12:48.494000-07:00 -> 2016-09-03T08:16:04-07:00
Trying to extrapolate 2428.80580027 > 2 * original 1179.72856372, resetting add_dist = 2359.45712743
Example1 Example 2 Example 3
simulator screen shot sep 9 2016 6 46 37 pm simulator screen shot sep 9 2016 6 46 32 pm simulator screen shot sep 9 2016 6 46 01 pm
simulator screen shot sep 9 2016 6 49 26 pm simulator screen shot sep 9 2016 6 48 00 pm simulator screen shot sep 9 2016 6 49 38 pm
shankari commented 8 years ago

Logs highlighting trips that are "weird". useful for checking and adding unit tests later. too_much_extrapolation.log.zip

shankari commented 8 years ago

Here are the examples above after the fix for setting the whole section as UNKNOWN and for https://github.com/e-mission/e-mission-server/issues/378#issuecomment-246036383.

Example 1 Example 2 Example 3
simulator screen shot sep 9 2016 8 41 22 pm simulator screen shot sep 9 2016 8 43 41 pm simulator screen shot sep 9 2016 8 45 25 pm
simulator screen shot sep 9 2016 8 45 47 pm simulator screen shot sep 9 2016 8 46 22 pm simulator screen shot sep 9 2016 8 50 20 pm
shankari commented 8 years ago

Just to check on the differences on 6th Nov 2015 and 3 Sep 2016, let's check if there was some untracked time around that time.

3 Sep 2016

Yup!

{'data': Untrackedtime({u'distance': 1901.1444322143084,
'end_place': ObjectId('57d37e86f6858f7be0293f02'),
u'start_loc': {u'type': u'Point', u'coordinates': [-122.0864147, 37.3908493]},
u'end_ts': 1472915568.494,
u'start_ts': 1472875421,
u'start_fmt_time': u'2016-09-02T21:03:41-07:00',
u'end_loc': {u'type': u'Point', u'coordinates': [-122.0873396, 37.3737677]},
u'source': u'DwellSegmentationTimeFilter',
'start_place': ObjectId('57d37e86f6858f7be0293f01'),
u'end_fmt_time': u'2016-09-03T08:12:48.494000-07:00',
u'duration': 40147.49399995804,
'_id': ObjectId('57d37c9cf6858f7be0293760'),
'key': 'analysis/cleaned_untracked'}

6th Nov 2015

Nope, no untracked time. A really long time at a place instead. Let's check if that is legit.

2016-09-09 20:31:21,864:DEBUG:Inserting entry Entry({'data': Cleanedplace(
{u'enter_fmt_time': u'2015-11-06T20:54:31-08:00',
'display_name': u'South Shoreline Boulevard, Mountain View',
'exit_fmt_time': '2015-11-08T11:22:29-08:00',
'ending_trip': ObjectId('57d379bbf6858f7be0280841'),
'starting_trip': ObjectId('57d379bff6858f7be0280af9'),
u'source': u'DwellSegmentationTimeFilter',
u'location': {u'type': u'Point', u'coordinates': [-122.0862597, 37.3909335]},
u'enter_ts': 1446872071,
'duration': 138478,
'raw_places': [ObjectId('57d37245f6858f616234ee51'),
ObjectId('57d37245f6858f616234ee51'),
ObjectId('57d37245f6858f616234ee53')],
'exit_ts': 1447010549}),
'_id': ObjectId('57d37cfbf6858f7be02939dc'),
'key': 'analysis/cleaned_place'})}) into timeseries

Looks legit to me!

2016-09-09 19:19:43,775:DEBUG:------------------------------2015-11-06T20:54:01-08:00------------------------------
2016-09-09 19:19:43,782:DEBUG:Too few points to make a decision, continuing
2016-09-09 19:19:43,783:DEBUG:------------------------------2015-11-06T20:54:31-08:00------------------------------
2016-09-09 19:19:43,791:DEBUG:Too few points to make a decision, continuing
...
2016-09-09 19:19:43,837:DEBUG:------------------------------2015-11-06T20:57:01-08:00------------------------------
2016-09-09 19:19:43,849:DEBUG:last5MinsDistances.max() = 3.63353401878, last10PointsDistance.max() = 3.63353401878
2016-09-09 19:19:43,851:DEBUG:Appending last_trip_end_point AttrDict({u'loc': {u'type': u'Point', u'coordinates': [-122.0862597, 37.3909335]}, u'ts': 1446872071.0, u'fmt_time': u'2015-11-06T20:54:31-08:00', '_id': ObjectId('563d867d7d65cb39ee9a8d79')) with index 21220 
2016-09-09 19:19:43,851:INFO:Found trip end at 2015-11-06T20:54:31-08:00
2016-09-09 19:19:43,853:DEBUG:------------------------------2015-11-08T11:20:59-08:00------------------------------
2016-09-09 19:19:43,855:DEBUG:Setting new trip start point AttrDict({u'loc': {u'type': u'Point', u'coordinates': [-122.0862367, 37.3909977]}, u'ts': 1447010459.0, u'fmt_time': u'2015-11-08T11:20:59-08:00', '_id': ObjectId('563fa3127d65cb39ee9a92ef')}) with idx 21227

And the location is practically identical. I guess I stayed overnight for two nights in Berkeley?

In [2]: ecc.calDistance([-122.0862597, 37.3909335], [-122.0862367, 37.3909977])
Out[2]: 7.422267207749954
shankari commented 8 years ago

Fixed as part of https://github.com/e-mission/e-mission-server/commit/112a4af4a63a609b65281721c9764be911083827, tests fixed as part of https://github.com/e-mission/e-mission-server/commit/94ec9951fd0a6e7820d42d2b83397070ee3073cf

shankari commented 8 years ago

this is now tracking the enhancement of splitting the section.