Fix time/distance calculations if many points at the beginning are missing

e-mission / e-mission-docs

Repository for docs and issues. If you need help, please file an issue here. Public conversations are better for open source projects than private email.

https://e-mission.readthedocs.io/en/latest

BSD 3-Clause "New" or "Revised" License

15 stars 34 forks source link

Fix time/distance calculations if many points at the beginning are missing #189

Closed shankari closed 5 years ago

shankari commented 8 years ago

Reported by @gtfierro.

Trip to Soda was 0.15 km, 3 mins Trip from Soda was 1km, 10 mins

Probably because we start distance calculation from the first point in the trip, not the place. If we want to fix this for the time too (and we should) need to extrapolate leaving time from speed for the rest of the trip.

shankari commented 8 years ago

I loaded data for Gabe for Jun 14-17 and Jun 21, and did not find a huge difference between the trip to Soda and from Soda.

Unfortunately, I did not record the date/time of these trips.

So the alternatives are:

this is a different issue, such as #345. That was the problem with the trip to the train station to pick to Anirudh - we skipped the trip and then joined home to the station. After the fix for #345, we now see the trip to the train station as well
the issue is from a day that I didn't record

I then looked at iOS, since it does not support a geofence radius and frequently generates points that are fairly far away.

While looking at the example that motivated the prior fixes to the section segmentation on iOS (https://github.com/e-mission/e-mission-server/pull/233), which is on 22 Feb 2016 from phone c528bcd2-a88b-3e82-be62-ef4f2396967a (iphone-3), I found several examples for short trips.

Kaiser to Cognition is 75m?	To bookstore 356m, from bookstore 737m

shankari commented 8 years ago

So the castro street one is because we only exited the geofence at Yosemite, three quarters of the way there.

>>> filtered_data_df[["fmt_time", "ts", "latitude", "longitude"]]

    fmt_time    ts  latitude    longitude
6   2016-02-22T10:06:20.206589-08:00    1456164380.207  37.387  -122.084
7   2016-02-22T10:06:21.940137-08:00    1456164381.940  37.387  -122.083
8   2016-02-22T10:07:00.000028-08:00    1456164420.000  37.386  -122.083

Points for Kaiser trip	Expected value (418m/3 mins) v/s (75m/40 secs)

shankari commented 8 years ago

For the trip to the bookstore,

First point was almost near the library	Expected value (885m/7 mins) versus (356m/7 mins)

Note that the trip back (with no geofence exit) was much closer to the distance (737m) but way off on the time (13m). I have found in the past that trip planners tend to overestimate my speed and I suspect that is what happened here. I suspect that the distance undercounting was because of not map matching, so we effectively "cut the corner".

shankari commented 8 years ago

I have now made fixes which I will be checking in shortly. However, even with the fixes, there are minor differences between there and back - e.g.

trip compare	first detail	second detail

It looks like the driving distance is consistent, but the walking distance on the trip back is almost twice as much as the trip there. This is because of a lot of zigzags at the start of the trip back, which may even be real - I think that we spent some time running on the track on the way back. That accounts for half (4 mins) of the 8 min difference. And the remaining 8 mins seems pretty legit - we travelled 4km in both, but one took 4 mins longer than the other. But, and here's the cool part, the speed profile show that max speed on the way over was $\approx$ 15 $m/s$ while the max speed over was $\approx$ 22 $m/s$. And note that if we got a zigzag while walking, then we must have started getting real points while walking which means that this is not extrapolation any more. So this difference seems legit - the trip back really was slower than the trip over.

2016-08-24 12:55:43,054:DEBUG:Found first section, need to extrapolate start point
2016-08-24 12:55:43,057:DEBUG:Start place is 57bdfbbcf6858fca5680bae4 and corresponding location is 576876b54a9bba50e3faac8e
2016-08-24 12:55:43,071:DEBUG:Adding distance 213.920895609 to original 4453.79714135 to extend section start from [-122.0853867, 37.3926637] to [-122.086605, 37.3910011]
2016-08-24 12:55:43,073:DEBUG:After subtracting time 38.3792739633 from original 541.609999895 to cover additional distance 213.920895609 at speed 5.57386509744, new_start_ts = 1466462970.28
2016-08-24 12:55:43,075:DEBUG:Setting section start values of Cleanedsection({u'sensed_mode': 0, u'end_ts': 1466463550.27, u'end_fmt_time': u'2016-06-20T15:59:10.270000-07:00', 'trip_id': ObjectId('57bdfbbef6858fca5680bb4c')}) to first point Location({'fmt_time': '2016-06-20T15:49:30.280726-07:00'})
--
2016-08-24 12:55:43,566:DEBUG:Found first section, need to extrapolate start point
2016-08-24 12:55:43,568:DEBUG:Start place is 57bdfbbcf6858fca5680bae8 and corresponding location is 576885484a9bba50e3faafe6
2016-08-24 12:55:43,586:DEBUG:Adding distance 282.357351097 to original 2202.34899663 to extend section start from [-122.1081566, 37.4142676] to [-122.1082317, 37.4168062]
2016-08-24 12:55:43,588:DEBUG:After subtracting time 55.6358303574 from original 441.741000175 to cover additional distance 282.357351097 at speed 5.07509907344, new_start_ts = 1466466614.05
2016-08-24 12:55:43,590:DEBUG:Setting section start values of Cleanedsection({u'sensed_mode': 2, u'end_ts': 1466467111.423, u'end_fmt_time': u'2016-06-20T16:58:31.423000-07:00', 'trip_id': ObjectId('57bdfbbff6858fca5680bb6f')}) to first point Location({'fmt_time': '2016-06-20T16:50:14.046170-07:00'})

Note further from the logs that the difference between the "adjustment" for the two trips was only around 30 secs. Which again means that the bulk of the difference is from actually measured points. This is legit!

shankari commented 8 years ago

FYI, if we load data for multiple days so that we actually start from a place instead of the beginning of a chain, then the trip to the optometrist is perfect!

Trip to optometrist	Trip to bookstore	Trip to school

Note that this still has some anomalies. In particular, the trip from home to the optometrist is 700m while a trip from home to the bookstore, which is further, is only 315m. I suspect that this is an iOS to android difference in the first recorded point for the section.

Hm. Looking at the logs, it looks like the last two trips (to the bookstore) have median speed 0. Since the distance doesn't depend on speed, this should not affect the distance calculated. But the distance calculations are the ones that seem to be off between the two trips.

2016-08-24 22:14:13,462:DEBUG:After subtracting time 0 from original 412.914735794 to cover additional distance 320.177270563 at speed 0.0, new_start_ts = 1456198902.09
2016-08-24 22:14:13,833:DEBUG:After subtracting time 0 from original 42.9999988079 to cover additional distance 50.2941394316 at speed 0.0, new_start_ts = 1456199930.0

shankari commented 8 years ago

After re-generating ground truth for all existing tests, all of them passed, EXCEPT testFeb22MultiSyncEndNotDetected, which failed because there were fewer speed entries than expected.

Investigating it further.

So the issue is with this section. Everything before this is fine.

    "feature_type": "section", 
    "sensed_mode": "MotionTypes.IN_VEHICLE", 
    "end_ts": 1456189018.000047, 
    "start_ts": 1456188577.7213736, 
    "start_fmt_time": "2016-02-22T16:49:37.721374-08:00", 
    "source": "SmoothedHighConfidenceMotion", 
    "end_fmt_time": "2016-02-22T16:56:58.000047-08:00",

    "feature_type": "section", 
    "sensed_mode": "MotionTypes.IN_VEHICLE", 
    "end_ts": 1456189018.000047, 
    "start_ts": 1456188567.1974702, 
    "start_fmt_time": "2016-02-22T16:49:27.197470-08:00", 
    "source": "SmoothedHighConfidenceMotion", 
    "end_fmt_time": "2016-02-22T16:56:58.000047-08:00",

Note that:

this is the trip back from karate. Should be a fairly straightforward single section trip.

Expected

2016-08-25 07:35:25,671:DEBUG:140735159468032:keys = (segmentation/raw_stop, segmentation/raw_section), len(places) = 0, len(trips) = 1
2016-08-25 07:35:25,673:DEBUG:140735159468032:Saving entries into cleaned section Cleanedsection({u'sensed_mode': 0, u'end_ts': 1456189018.000047, u'end_loc': {u'type': u'Point', u'coordinates': [-122.0861964953338, 37.39118427507336]}, u'source': u'SmoothedHighConfidenceMotion', u'end_fmt_time': u'2016-02-22T16:56:58.000047-08:00', u'end_local_dt': {u'hour': 16, u'month': 2, u'second': 58, u'weekday': 0, u'year': 2016, u'timezone': u'America/Los_Angeles', u'day': 22, u'minute': 56}, 'trip_id': ObjectId('57bf022df6858f153fc099b5')})
2016-08-25 07:35:25,726:DEBUG:140735159468032:Found first section, need to extrapolate start point
2016-08-25 07:35:25,728:DEBUG:140735159468032:Start place is 57bf022bf6858f153fc09971 and corresponding location is 56cbcb4aeaedff78c763657d
2016-08-25 07:35:25,755:DEBUG:140735159468032:Adding distance 50.9919317646 to original 4903.32301433 to extend section start from [-122.1075074003284, 37.41650822575895] to [-122.1080831532572, 37.41654267538098]
2016-08-25 07:35:25,757:DEBUG:140735159468032:After subtracting time 3.8025197396 from original 447.000056982 to cover additional distance 50.9919317646 at speed 13.4100373585, new_start_ts = 1456188567.2
2016-08-25 07:35:25,759:DEBUG:140735159468032:Resampling data of size 88
2016-08-25 07:35:25,782:DEBUG:140735159468032:After resampling, using 17 points

Actual

2016-08-25 07:27:58,256:DEBUG:140735159468032:Saving entries into cleaned section Cleanedsection({u'sensed_mode': 0, u'end_ts': 1456189018.000047, u'end_loc': {u'type': u'Point', u'coordinates': [-122.0861964953338, 37.39118427507336]}, u'source': u'SmoothedHighConfidenceMotion', u'end_fmt_time': u'2016-02-22T16:56:58.000047-08:00', u'end_local_dt': {u'hour': 16, u'month': 2, u'second': 58, u'weekday': 0, u'year': 2016, u'timezone': u'America/Los_Angeles', u'day': 22, u'minute': 56}, 'trip_id': ObjectId('57bf006ef6858f1228965877')})
2016-08-25 07:27:58,338:DEBUG:140735159468032:Adding distance 98.3841738709 to original 4851.28821369 to extend section start from [-122.1070484911297, 37.41621473341939] to [-122.1080831532572, 37.41654267538098]
2016-08-25 07:27:58,340:DEBUG:140735159468032:After subtracting time 7.27861735811 from original 433.000056028 to cover additional distance 98.3841738709 at speed 13.516876768, new_start_ts = 1456188577.72
2016-08-25 07:27:58,342:DEBUG:140735159468032:Resampling data of size 87
2016-08-25 07:27:58,369:DEBUG:140735159468032:After resampling, using 16 points
2016-08-25 07:27:58,414:INFO:140735159468032:removed 0 entries containing n/a

It seems pretty clear that we are one point off in the actual (87 instead of 88). Note also that the start point is slightly off ([-122.1070484911297, 37.41621473341939] instead of [-122.1075074003284, 37.41650822575895]). This results in a slightly different adjustment (7 secs instead of 3 secs) and a slightly different new start ts (1456188577.72 instead of 1456188567.2).

So it looks like we need to see why the points are slightly different. Remaining question: if the new start_ts is different, why didn't the start_ts check fail before this?

shankari commented 8 years ago

Remaining question: if the new start_ts is different, why didn't the start_ts check fail before this?

Because of the fuzz. Recall that when we worked on this last, we were unable to get it to work perfectly, which is why we introduced the fuzz factor for distance and time. But if we introduce a 60 sec fuzz factor for time, and resample 30 secs, our points may be off by up to 2 points.

We need to add a fuzz factor for points as well.

shankari commented 8 years ago

The PR to fix this caused this test to fail.

======================================================================
FAIL: testCountTimestampMetrics (__main__.TestMetrics)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "emission/tests/netTests/TestMetrics.py", line 58, in testCountTimestampMetrics    self.assertEqual(user_met_result[0].IN_VEHICLE, 3)
AssertionError: 4 != 3

----------------------------------------------------------------------
Ran 1 test in 39.801s

FAILED (failures=1)

This was a bit surprising because we hadn't actually changed the segmentation in any way. So we investigated it some more.

And the answer will shock you!

It was all about the timezones!

Note that for timestamp based queries, we roll over at midnight UTC = 4pm PST = 5pm PDT. This was Aug, so PDT. And lo and behold, we had a trip that started just before 5pm. I speculate that before we fixed it, the trip had started after 5pm and so was not counted for the metrics. After we fixed it, it does.

simulator screen shot aug 25 2016 1 14 29 pm

Looking at the logs, the discrepancy is from the 27th.

'user_metrics': [
ModeStatTimeSummary({'ON_FOOT': 4 'BICYCLING': 2 'nUsers': 1 'fmt_time': '2015-08-27T00:00:00+00:00' 'ts': 1440633600 'IN_VEHICLE': 4})
ModeStatTimeSummary({'ON_FOOT': 2 'BICYCLING': 2 'nUsers': 1 'fmt_time': '2015-08-28T00:00:00+00:00' 'ts': 1440720000 'IN_VEHICLE': 1})]

We can see that the raw trip started at 17:03 instead of 16:52.

2016-08-25 12:21:55,526:DEBUG:140735159468032:Considering trip 57bf4550f6858f6e59c6798f: 2015-08-27T08:18:27.867000-07:00 -> 2015-08-27T08:30:08.302000-07:00 
2016-08-25 12:21:55,526:DEBUG:140735159468032:Considering trip 57bf4550f6858f6e59c67991: 2015-08-27T08:34:22.943000-07:00 -> 2015-08-27T08:41:16.784000-07:00 
2016-08-25 12:21:55,527:DEBUG:140735159468032:Considering trip 57bf4550f6858f6e59c67993: 2015-08-27T08:51:58.768000-07:00 -> 2015-08-27T09:53:44.894000-07:00 
2016-08-25 12:21:55,527:DEBUG:140735159468032:Considering trip 57bf4550f6858f6e59c67995: 2015-08-27T10:05:52.989000-07:00 -> 2015-08-27T11:14:26.669000-07:00 
2016-08-25 12:21:55,527:DEBUG:140735159468032:Considering trip 57bf4550f6858f6e59c67997: 2015-08-27T11:25:33.687000-07:00 -> 2015-08-27T11:27:50.129000-07:00 
2016-08-25 12:21:55,528:DEBUG:140735159468032:Considering trip 57bf4550f6858f6e59c67999: 2015-08-27T15:59:27.376000-07:00 -> 2015-08-27T16:54:59.470000-07:00 
2016-08-25 12:21:55,529:DEBUG:140735159468032:Considering trip 57bf4550f6858f6e59c6799b: 2015-08-27T17:03:59.012000-07:00 -> 2015-08-27T17:55:34.898000-07:00 
2016-08-25 12:21:55,529:DEBUG:140735159468032:Considering trip 57bf4550f6858f6e59c6799d: 2015-08-27T19:21:59.971000-07:00 -> 2015-08-27T19:32:22.709000-07:00

And yes, we have a large gap (4km!) in the start, so we subtract 717 seconds = 11 minutes from the previous start. This brings us to 16:52.

2016-08-25 12:21:59,909:DEBUG:140735159468032:About to execute query {'data.trip_id': ObjectId('57bf4550f6858f6e59c6799b'), 'user_id': UUID('6c8c2989-60a6-4f8c-a9d8-2f5f640c014b'), 'metadata.key': 'segmentation/raw_stop'} with sort_key data.enter_ts
2016-08-25 12:21:59,932:DEBUG:140735159468032:About to execute query {'data.trip_id': ObjectId('57bf4550f6858f6e59c6799b'), 'user_id': UUID('6c8c2989-60a6-4f8c-a9d8-2f5f640c014b'), 'metadata.key': 'segmentation/raw_section'} with sort_key data.start_ts
2016-08-25 12:21:59,951:DEBUG:140735159468032:keys = (segmentation/raw_stop, segmentation/raw_section), len(places) = 3, len(trips) = 4
2016-08-25 12:22:00,006:DEBUG:140735159468032:Found first section, need to extrapolate start point
2016-08-25 12:22:00,009:DEBUG:140735159468032:Start place is 57bf4550f6858f6e59c6799a and corresponding location is 55dfad567d65cb39ee986c5d
2016-08-25 12:22:00,037:DEBUG:140735159468032:Adding distance 4320.00318792 to original 31535.2776074 to extend section start from [-122.3446802, 37.57984] to [-122.3871046, 37.5993148]
2016-08-25 12:22:00,040:DEBUG:140735159468032:After subtracting time 717.167317901 from original 2345.21700001 to cover additional distance 4320.00318792 at speed 6.02370336753, new_start_ts = 1440719521.84

The only caveat with this is that the previous trip only ended at 16:54, so we have clearly overshot the mark! We need to ensure that the start time for a trip is after the end time for the previous trip + some small delta. The delta should be around 2.5 minutes since we detect a trip end after 5 mins of no motion, and use the median of the "trip end" points for the real trip end. With those changes, this will change to 4:57, which means that it will still be counted!

shankari commented 8 years ago

Note that this still has some anomalies. In particular, the trip from home to the optometrist is 700m while a trip from home to the bookstore, which is further, is only 315m. I suspect that this is an iOS to android difference in the first recorded point for the section.

Hm. Looking at the logs, it looks like the last two trips (to the bookstore) have median speed 0. Since the distance doesn't depend on speed, this should not affect the distance calculated. But the distance calculations are the ones that seem to be off between the two trips.

Let's finish investigating this. These are the two bookstore trips.

2016-08-24 22:14:12,163:DEBUG:Considering trip 57be7ea2f6858fffc9ba268d: 2016-02-22T19:41:42.085297-08:00 -> 2016-02-22T19:48:35.000033-08:00 
2016-08-24 22:14:12,163:DEBUG:Considering trip 57be7ea2f6858fffc9ba268f: 2016-02-22T19:58:50.000111-08:00 -> 2016-02-22T20:12:15.000095-08:00

Considering the first trip to the bookstore. There is only one section.

2016-08-24 22:14:13,289:DEBUG:About to execute query {'data.trip_id': ObjectId('57be7ea2f6858fffc9ba268d'), 'user_id': UUID('079e0f1a-c440-3d7c-b0e7-de160f748e35'), 'metadata.key': 'segmentation/raw_stop'} with sort_key data.enter_ts
2016-08-24 22:14:13,318:DEBUG:About to execute query {'data.trip_id': ObjectId('57be7ea2f6858fffc9ba268d'), 'user_id': UUID('079e0f1a-c440-3d7c-b0e7-de160f748e35'), 'metadata.key': 'segmentation/raw_section'} with sort_key data.start_ts
2016-08-24 22:14:13,361:DEBUG:keys = (segmentation/raw_stop, segmentation/raw_section), len(places) = 0, len(trips) = 1
2016-08-24 22:14:13,361:DEBUG:places_or_stops = []

Almost half the distance is not tracked.

2016-08-24 22:14:13,450:DEBUG:Adding distance 320.177270563 to original 422.168587301 to extend section start from [-122.0826288376279, 37.39067786774222] to [-122.0861964953338, 37.39118427507336]

But almost every alternate point has zero speed. So the median speed is zero and we end up skipping the addition.

2016-08-24 22:14:13,458:INFO:speeds for this section are 0     0.000000
1     0.000000
2     5.915331
3     0.000000
4     0.790988
5     0.000000
6     1.584267
7     0.000000
8     1.547820
9     0.000000
10    1.094819
11    0.000000
12    0.657234
13    0.000000
14    1.145274
15    0.000000
16    1.737425
17    0.000000
18    0.596737
19    0.000000
Name: speed, dtype: float64, median is 0.0, skipping

I have no idea why every alternate trip is zero. There is a distance filter on iOS, so this shouldn't happen. Are we loading duplicate entries or sth?

shankari commented 8 years ago

Yup! Duplicate entries

[(1456198902.085297, [-122.0826288376279, 37.39067786774222]),
 (1456198902.085297, [-122.0826288376279, 37.39067786774222]),
 (1456198905.00003, [-122.0828158223276, 37.39063345830645]),
 (1456198905.00003, [-122.0828158223276, 37.39063345830645]),
 (1456198969.000034, [-122.0823213738595, 37.39040337506439]),
 (1456198969.000034, [-122.0823213738595, 37.39040337506439]),
 (1456199001.000033, [-122.0821464435403, 37.39083759955819]),
 (1456199001.000033, [-122.0821464435403, 37.39083759955819]),
 (1456199034.000033, [-122.0819117502515, 37.39125740717854]),
 (1456199034.000033, [-122.0819117502515, 37.39125740717854]),
 (1456199080.000033, [-122.0815337264185, 37.39159641325231]),
 (1456199080.000033, [-122.0815337264185, 37.39159641325231]),
 (1456199158.000033, [-122.0809587278609, 37.39153438716884]),
 (1456199158.000033, [-122.0809587278609, 37.39153438716884]),
 (1456199202.000034, [-122.0804026724045, 37.39163538910206]),
 (1456199202.000034, [-122.0804026724045, 37.39163538910206]),
 (1456199231.000033, [-122.0802168456112, 37.39206378817315]),
 (1456199231.000033, [-122.0802168456112, 37.39206378817315]),
 (1456199315.000033, [-122.0797878598069, 37.39235883116479]),
 (1456199315.000033, [-122.0797878598069, 37.39235883116479])]

shankari commented 8 years ago

There are 4 entries for the timestamp in the exported file.

bash-3.2$ grep 1456198902.085297 emission/tests/data/real_examples/iphone_2016-02-22
            "ts": 1456198902.085297, 
            "ts": 1456198902.085297, 
            "ts": 1456198902.085297, 
            "ts": 1456198902.085297,

Two location and two filtered_location. All of them have the same timestamp for the location. The two locations and the two filtered_locations have slightly different write tmes, as would be consistent with processing the location and writing the filtered value. However, the write_ts for the two location entries and for the two filtered_location entries is the same. Let's check the ids now.

            "loc": {
                "type": "Point", 
                "coordinates": [
                    -122.0826288376279, 
                    37.39067786774222
                ]
            }, 
            "fmt_time": "2016-02-22T19:41:42.085297-08:00", 
            "ts": 1456198902.085297, 

        "metadata": {
            "write_fmt_time": "2016-02-22T19:41:43.073609-08:00", 
            "write_ts": 1456198903.073609, 
            "platform": "ios", 
            "key": "background/location",

            "loc": {
                "type": "Point", 
                "coordinates": [
                    -122.0826288376279, 
                    37.39067786774222
                ]
            }, 
            "fmt_time": "2016-02-22T19:41:42.085297-08:00", 
            "ts": 1456198902.085297, 

        "metadata": {
            "write_fmt_time": "2016-02-22T19:41:43.092138-08:00", 
            "write_ts": 1456198903.092138, 
            "platform": "ios", 
            "key": "background/filtered_location",

            "loc": {
                "type": "Point", 
                "coordinates": [
                    -122.0826288376279, 
                    37.39067786774222
                ]
            }, 
            "fmt_time": "2016-02-22T19:41:42.085297-08:00", 
            "ts": 1456198902.085297, 

        "metadata": {
            "write_fmt_time": "2016-02-22T19:41:43.073609-08:00", 
            "write_ts": 1456198903.073609, 
            "platform": "ios", 
            "key": "background/location",

            "loc": {
                "type": "Point", 
                "coordinates": [
                    -122.0826288376279, 
                    37.39067786774222
                ]
            }, 
            "fmt_time": "2016-02-22T19:41:42.085297-08:00", 
            "altitude": 29.63663482666016, 
            "floor": 0, 
            "vaccuracy": 19.94040498526955, 
            "ts": 1456198902.085297, 
            "longitude": -122.0826288376279, 

        "metadata": {
            "write_fmt_time": "2016-02-22T19:41:43.092138-08:00", 
            "write_ts": 1456198903.092138, 
            "key": "background/filtered_location"
        }
    },

ids are different

bash-3.2$ grep --after-context=20 1456198902.085297 emission/tests/data/real_examples/iphone_2016-02-22 | egrep "ts|oid" 
            "ts": 1456198902.085297, 
            "$oid": "56cbd94beaedff78c7636754"
            "ts": 1456198902.085297, 
            "$oid": "56cbd94beaedff78c7636755"
            "ts": 1456198902.085297, 
            "$oid": "56cbf570eaedff78c7636a47"
            "ts": 1456198902.085297, 
            "$oid": "56cbf570eaedff78c7636a48"

I wonder if this happened during some kind of error handling when we moved things back to the intake pipeline and re-ran...

shankari commented 8 years ago

Note that we must have received these filtered points from the client. Because when we generate filtered points on the server, we check for duplicates before inserting.

            for idx, entry in filtered_from_unfiltered_df.iterrows():
                # First, we check to see if this is a duplicate of an existing entry.
                # If so, we will skip it since it is probably generated as a duplicate...
                if check_prior_duplicate(filtered_from_unfiltered_df, idx, entry):
                    logging.info("Found duplicate entry at index %s, id = %s, lat = %s, lng = %s, skipping" %
                                    (idx, entry._id, entry.latitude, entry.longitude))
                    continue
                # Next, we check to see if there is an existing "background/filtered_location" point that corresponds
                # to this point. If there is, then we don't want to re-insert. This ensures that this step is idempotent
                if check_existing_filtered_location(timeseries, entry):
                    logging.info("Found existing filtered location for entry at index = %s, id = %s, ts = %s, fmt_time = %s, skipping" % (idx, entry._id, entry.ts, entry.fmt_time))
                    continue
                # logging.debug("Inserting %s filtered entry %s into timeseries" % (idx, entry))
                entry_copy = convert_to_filtered(timeseries.get_entry_at_ts(
                                                    "background/location",
                                                    "metadata.write_ts",
                                                    entry.metadata_write_ts))
                timeseries.insert(entry_copy)

shankari commented 8 years ago

for now, I am not going to mess with the raw data because I don't want to take the risk of deleting something erroneously. Instead, I am going to drop duplicates before computing speed and distance by using http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html

Example in the associated ipython notebook

shankari commented 8 years ago

Dropping duplicates was surprisingly hard. We wanted to drop duplicates before computing the new start, and we wanted to include the metadata.write_ts in the comparison, but we first converted from entry_list -> location_list -> loc_df.

This also meant that we used some kind of hardcoded conversion from loc objects -> loc_df instead of using the standard conversion, which meant that the standard drop_duplicates call was not invoked.

Instead of patching on another fix on this mess, I just switched everything to dataframes. We can now ignore entries by dropping columns from the dataframe directly instead of iterating over the entries again, and we add a row to the dataframe instead of creating a new location object and adding it to the list.

With these fixes, the distance to the bookstore is much closer to what it needs to be, but the trip back is still short.

Trip to	Trip from

shankari commented 8 years ago

Given the large number of blue lines in the trip from the bookstore above, I suspect that the issue with the trip back is

    # TODO: DO we need to add the stop distances too? They should be small...
    trip_distance = sum([section.data.distance for section in section_map.values()])

shankari commented 8 years ago

After fixing the first case, one other test also failed - the one from Jun 20th. The end time for the trip changed from "2016-06-20T08:41:43.762000-07:00" to "2016-06-20T08:41:15.856000-07:00".

Looking at the segmentation, we found the trip end as soon as we had enough points.

2016-08-25 21:27:57,744:DEBUG:140735159468032:------------------------------2016-06-20T08:43:22.355000-07:00------------------------------
2016-08-25 21:27:57,748:DEBUG:140735159468032:last5MinsDistances = [   9.27080686    4.06121534  123.63208568    4.20376239    0.0626857
  147.47460643  147.47460643] with length 7
2016-08-25 21:27:57,752:DEBUG:140735159468032:last10PointsDistances = [   9.26045939    4.59432788    9.27080686    4.06121534  123.63208568
    4.20376239    0.0626857   147.47460643  147.47460643    0.        ] with length 10, shape (10,)
2016-08-25 21:27:57,755:DEBUG:140735159468032:len(last10PointsDistances) = 10, len(last5MinsDistances) = 7
2016-08-25 21:27:57,755:DEBUG:140735159468032:last5MinsTimes.max() = 259.43900013, time_threshold = 300
2016-08-25 21:27:57,755:DEBUG:140735159468032:prev_point.ts = 1466437335.54, curr_point.ts = 1466437402.36, time gap = 66.8140001297 (vs 300), distance_gap = 147.47460643 (vs 100
), speed_gap = 2.20724108935 (vs 0.333333333333) continuing trip
2016-08-25 21:27:57,756:DEBUG:140735159468032:Too few points to make a decision, continuing
2016-08-25 21:27:57,756:DEBUG:140735159468032:------------------------------2016-06-20T08:44:22.417000-07:00------------------------------
2016-08-25 21:27:57,761:DEBUG:140735159468032:last5MinsDistances = [ 72.83090881  70.53045336  68.30626558  71.52389999  87.42540614
  87.42540614  71.53191086] with length 7
2016-08-25 21:27:57,767:DEBUG:140735159468032:last10PointsDistances = [ 72.00492986  72.87083898  72.83090881  70.53045336  68.30626558
  71.52389999  87.42540614  87.42540614  71.53191086   0.        ] with length 10, shape (10,)
2016-08-25 21:27:57,770:DEBUG:140735159468032:len(last10PointsDistances) = 10, len(last5MinsDistances) = 7
2016-08-25 21:27:57,770:DEBUG:140735159468032:last5MinsTimes.max() = 292.127000093, time_threshold = 300
2016-08-25 21:27:57,770:DEBUG:140735159468032:prev_point.ts = 1466437402.36, curr_point.ts = 1466437462.42, time gap = 60.0620000362 (vs 300), distance_gap = 71.531910859 (vs 100
), speed_gap = 1.19096784682 (vs 0.333333333333) continuing trip
2016-08-25 21:27:57,770:DEBUG:140735159468032:last5MinsDistances.max() = 87.4254061422, last10PointsDistance.max() = 87.4254061422
2016-08-25 21:27:57,771:DEBUG:140735159468032:last5MinsPoints and last10PointsMedian found, last_trip_end_index = 27
2016-08-25 21:27:57,772:DEBUG:140735159468032:Appending last_trip_end_point AttrDict({u'local_dt_hour': 8, u'local_dt_timezone': u'America/Los_Angeles', u'local_dt_second': 15, u'local_dt_year': 2016, u'local_dt_month': 6, u'local_dt_weekday': 0, u'elapsedRealtimeNanos': 686575066000000, u'loc': {u'type': u'Point', u'coordinates': [-122.0826931, 37.3914184]}, 'user_id': UUID('2997a8bd-e1b9-47f8-b9a5-531fb6f6e466'), u'altitude': 0.0, u'ts': 1466437275.8559999, u'latitude': 37.391418399999999, u'sensed_speed': 0.0, u'accuracy': 39.0, u'fmt_time': u'2016-06-20T08:41:15.856000-07:00', u'local_dt_minute': 41, 'metadata_write_ts': 1466437282.2969999, u'longitude': -122.0826931, u'filter': u'time', '_id': ObjectId('576810cb4a9bba50e3fa1daf'), u'heading': 0, u'local_dt_day': 20}) with index 27 
2016-08-25 21:27:57,772:INFO:140735159468032:Found trip end at 2016-06-20T08:41:15.856000-07:00

And the point right after the end was the current end

2016-08-25 21:27:57,694:DEBUG:140735159468032:------------------------------2016-06-20T08:41:15.856000-07:00------------------------------
2016-08-25 21:27:57,706:DEBUG:140735159468032:------------------------------2016-06-20T08:41:43.762000-07:00------------------------------

And we deduped a bunch of points

2016-08-25 21:27:57,268:DEBUG:140735159468032:curr_query = {'$or': [{'metadata.key': 'background/filtered_location'}], 'metadata.write_ts': {'$lte': 1472185672.217289}, 'user_id'
: UUID('2997a8bd-e1b9-47f8-b9a5-531fb6f6e466')}, sort_key = metadata.write_ts
2016-08-25 21:27:57,269:DEBUG:140735159468032:orig_ts_db_keys = ['background/filtered_location'], analysis_ts_db_keys = []
2016-08-25 21:27:57,269:DEBUG:140735159468032:finished querying values for ['background/filtered_location']
2016-08-25 21:27:57,269:DEBUG:140735159468032:finished querying values for []
2016-08-25 21:27:57,319:DEBUG:140735159468032:Found 256 results
2016-08-25 21:27:57,321:DEBUG:140735159468032:After de-duping, converted 256 points to 222

I am surprised that we had duplicates as recent as Jun, and from android. If we can very that there really are duplicates, then this seems legit.

shankari commented 8 years ago

wrt https://github.com/e-mission/e-mission-server/issues/288#issuecomment-242617244, yup! adding stop distances fixes it. Note that the stop distances still don't count for the mode. We may need to fix the segmentation on iOS to account for this better.

summary	detail

shankari commented 8 years ago

Even after fixing the over-extrapolation, if the extrapolation is a lot but the subsequent section is walk/bike, we can end up with super-long walk or bike trips.

Why do we have a 36km long walk section from 5:26pm to 6:10pm in a trip that ends at 6:17pm?

These are the raw trips, so must be extrapolation.

2016-09-09 11:40:48,659:DEBUG:Considering trip 57d2ff66f6858fba8c1fe955: 2016-08-26T15:46:46-07:00 -> 2016-08-26T17:21:11-07:00 
2016-09-09 11:40:48,660:DEBUG:Considering trip 57d2ff66f6858fba8c1fe957: 2016-08-26T18:01:04-07:00 -> 2016-08-26T18:17:07-07:00 
2016-09-09 11:40:48,660:DEBUG:Considering trip 57d2ff66f6858fba8c1fe959: 2016-08-26T18:26:45-07:00 -> 2016-08-26T18:27:42-07:00 
2016-09-09 11:40:48,660:DEBUG:Considering trip 57d2ff66f6858fba8c1fe95b: 2016-08-26T18:41:03-07:00 -> 2016-08-26T18:54:56.790000-07:00

Is this a legit trip, or is it missed data?

2016-09-09 11:58:30,840:DEBUG:Raw place is 57d2ff66f6858fba8c1fe956 and corresponding location is 57c0e5730a3eb8b130077249
2016-09-09 11:58:30,854:DEBUG:Adding distance 35674.1330809 to original 479.985531524 to extend section start from [-122.0762475, 37.3944149] to [-122.386858, 37.5998404]
2016-09-09 11:58:30,854:DEBUG:Trying to extrapolate 35674.1330809 > 2 * original 479.985531524, resetting add_dist = 959.971063048
2016-09-09 11:58:30,856:DEBUG:After subtracting time 2053.39006462 from original 539.0 to cover additional distance 959.971063048 at speed 0.467505458211, new_start_ts = 1472257610.61

So we clearly did move from Millbrae to Mountain View at a fairly fast speed, so it is legit to create a trip to join them. We just can't make it a walk trip.

2016-09-09 11:28:54,116:DEBUG:In range 1472257271 -> 1472259664 found transitions                            fmt_time  curr_state  transition
0  2016-08-26T17:24:29.220000-07:00           2           2
1  2016-08-26T17:39:02.244000-07:00           1           1
2016-09-09 11:28:54,120:DEBUG:while determining new_start_place, transition_distance = 35674.1330809
2016-09-09 11:28:54,120:DEBUG:while determining new_start_place, time_delta = 2393, transition_speed = 14.9077029172
2016-09-09 11:28:54,120:DEBUG:transition_speed 14.9077029172 > 0.000555555555556, returning False

2016-09-09 11:26:43,271:DEBUG:------------------------------2016-08-26T17:24:29-07:00------------------------------
2016-09-09 11:26:43,289:INFO:Found trip end at 2016-08-26T17:21:11-07:00
2016-09-09 11:26:43,289:DEBUG:------------------------------2016-08-26T18:01:04-07:00------------------------------
2016-09-09 11:26:43,291:DEBUG:Setting new trip start point AttrDict({}) with idx 99246
2016-09-09 11:26:43,298:DEBUG:Too few points to make a decision, continuing

I declare that this is too finicky to fix. To work around this, I will simply say that if we are extrapolating a lot for a non-motorized section, and the resulting speed ends up > 20kmph for walk or > 40 kmph for bike, then we change the mode to UNKNOWN.

Detail	List

shankari commented 8 years ago

Remaining enhancement tracked in #388