e-mission / e-mission-docs

Repository for docs and issues. If you need help, please file an issue here. Public conversations are better for open source projects than private email.
https://e-mission.readthedocs.io/en/latest
BSD 3-Clause "New" or "Revised" License
15 stars 34 forks source link

User reports lots of spurious trips on iOS #704

Open shankari opened 2 years ago

shankari commented 2 years ago

"Thank you. Trips are showing a straight line across town. "

shankari commented 2 years ago

Hm, the user does not appear to have any transitions for the past week

>>> start_ts = arrow.get("2022-02-01").timestamp
>>> end_ts = arrow.get("2022-02-09").timestamp
>>> transition_df = ts.get_data_df("statemachine/transition", time_query=estt.TimeQuery("data.ts", startTs=start_ts, endTs=end_ts))

Returns an empty dataframe.

shankari commented 2 years ago

Searching backwards, we find that the last transition was from 2021-12-08T17:08:52.017765-07:00

shankari commented 2 years ago
transition_df = ts.get_data_df("analysis/confirmed_trip", time_query=estt.TimeQuery("data.start_ts", startTs=start_ts, endTs=end_ts))
transition_df.tail()

shows us that the last trip is indeed from

2022-02-08T17:05:22.999877-07:00

Need to investigate why we stopped getting transitions and how our algorithm works when they are not present This is likely the root cause.

shankari commented 2 years ago

Focusing on trips from the 7th of Feb, we see a clear spike at around 7k

image

which persists while zooming in

image

There also appears to be an issue where the durations for that trip seem to be all over the map.

image

shankari commented 2 years ago

Doing an initial pass at classifying good vs. bad:

potential_bad_trips = feb_7_confirmed_trip_df[np.logical_and(feb_7_confirmed_trip_df.distance > 6500, feb_7_confirmed_trip_df.distance < 7500)]
potential_good_trips = feb_7_confirmed_trip_df[np.logical_or(feb_7_confirmed_trip_df.distance < 6500, feb_7_confirmed_trip_df.distance > 7500)]

And plotting the trips, they are indeed in a straight line across town (maps redacted for privacy reasons). Interestingly, while trying to plot the non resampled locations, it looks like there are none.

Found 0 features from 0 points
Found 0 features from 0 points
Found 0 features from 0 points
...
Found 0 features from 0 points
Found 0 features from 0 points
Found 0 features from 0 points

Checking to see if this is a characteristic of all potential bad trips and of any potential good trips.

shankari commented 2 years ago

There are no location points for the bad trips.

>>> pd.Series([len(ts.get_data_df("background/location", time_query = esdat.get_time_query_for_trip_like_object(ecwct.Confirmedtrip(t))))
    for t in potential_bad_trips.to_dict(orient="records")]).unique()
array([0])

There are no location points for the good trips as well.

>>> pd.Series([len(ts.get_data_df("background/location", time_query = esdat.get_time_query_for_trip_like_object(ecwct.Confirmedtrip(t))))
    for t in potential_good_trips.to_dict(orient="records")]).unique()
array([0])

There are apparently no location points for the entire month of Feb.

>>> ts.get_data_df("background/location", time_query=estt.TimeQuery("data.start_ts", startTs=start_ts, endTs=end_ts))
_

Last location point was from December as well 2021-12-08. Wait - maybe we stopped storing the values after Dec because we hit the query limit.

shankari commented 2 years ago

It also turned out that we hadn't filtered for the 7th correctly. After fixing this, we now have:

>>> (len(potential_good_trips), len(potential_bad_trips))
(15, 27)

But every single trip seems to be a straight line, BUT they don't always have the same endpoints. The main difference between the "good" and "bad" trips seems to be that the endpoints sometimes double back.

But given that they are straight lines, the distance between the endpoints and the distance of the trip are likely to be the same. Let's see if that helps.

shankari commented 2 years ago

Ah, they are straight lines there are back. The actual O-D distance, even for the "bad trips" is very small

>>> potential_bad_trips[["distance", "od_distance"]]
distance od_distance
7073.937531 1.853653e+01
6868.309384 1.127387e+01
7100.056180 1.170578e+01
6710.867451 1.207159e+01
7123.187586 1.223378e+01
7079.736764 2.990119e+01
6735.314398 3.331056e+00
7096.913321 3.156922e+00
7119.615165 5.456963e-01

image

Unfortunately, that means that we can't actually use the o-d distance, since this could happen legitimately for a round trip. But maybe for this user, for the immediate use case, it can be a good check?

shankari commented 2 years ago

Looking at the potential good trips, we have image

zooming in on trips with OD below 3000, they are all the round trips

image

shankari commented 2 years ago

So if we categorize further:

>>> potential_bad_in_good = potential_good_trips[potential_good_trips.od_distance < 3000]
>>> potential_good_in_good = potential_good_trips[potential_good_trips.od_distance > 3000]
>>> len(potential_bad_trips), len(potential_bad_in_good), len(potential_good_in_good)
(27, 11, 4)

Visualizing those 4 trips, we get what appear to be one-way trips. But we can probably start with this for now and let the user mark the 4/(27+11+4) = 10% of bad one-way trips manually.

Let's see how many trips from the beginning of Feb would be affected.

shankari commented 2 years ago
start_ld = ecwld.LocalDate(year=2022, month=2, day=1)
end_ld = ecwld.LocalDate(year=2022, month=2, day=28)
all_jan_feb_confirmed_trip_df = ts.get_data_df("analysis/confirmed_trip", time_query=esttc.TimeComponentQuery("data.start_local_dt", start_ld, end_ld))
all_jan_feb_confirmed_trip_df["od_distance"] = all_jan_feb_confirmed_trip_df.apply(lambda r: ecc.calDistance(r.start_loc["coordinates"], r.end_loc["coordinates"], coordinates=False), axis=1)
all_feb_potential_bad_trips = all_jan_feb_confirmed_trip_df[all_jan_feb_confirmed_trip_df.od_distance < 100]
len(all_feb_potential_bad_trips), len(all_jan_feb_confirmed_trip_df)
Result: (58, 75)

Majority are from the 6th. 7th and 8th. One from the 1st. Scatter plot shows vertical lines at various distances.

image

shankari commented 2 years ago

Durations range from 1000 secs (1000/60 = 16 mins) to 4000 secs (4000/60 = 66 mins = 1 hour). image

No clear signal in speeds either

image

shankari commented 2 years ago

to recap, at this point, we have a pretty good check (OD-distance < 100m). Any false negatives (trip was spurious but we didn't catch it), can be handled by the user, this would be a max of 17. Any false positives might be a problem, and we might want to come up with an additional check. This is likely to involve the actual location points.

Let's plot the trip from the first since it is most likely to be the false positive (if one exists). The first is a false positive.

shankari commented 2 years ago

Checking the other fields, it is a lot more than 7k in distance. Let's plot the other trips with > 7k in distance and see if they are spurious.

shankari commented 2 years ago

So there are 8 trips > 7k in distance

index  start_local_dt_month start_local_dt_day end_local_dt_month end_local_dt_day duration distance od_distance mean_speed
3 2 1 2 1 2308.166839 14184.026960 1.678697e+01 6.145148
11 2 6 2 6 1481.346913 14066.098733 5.955038e-03 9.495479
35 2 7 2 7 1571.072589 13915.663279 2.087384e-01 8.857429
37 2 7 2 7 1552.858794 12220.480719 4.796685e-01 7.869666
44 2 7 2 7 2387.434461 20542.246691 7.900875e-10 8.604319
47 2 7 2 7 1806.244204 13965.648219 4.733766e-01 7.731872
51 2 7 2 7 3133.604467 14063.391410 7.117096e+00 4.487928
71 2 8 2 8 4443.580903 31850.950049 2.713432e+00 7.167856

On mapping them, the first and last entries (3 and 71) are valid round trips. The others are not.

Plotting the various trip level metrics, we don't see a clear separation between valid and invalid.

image image

shankari commented 2 years ago

Re-exported data for only the year 2022.

We now see transitions, and all the transitions for the 7th seem to be visit only, without a corresponding geofence exit. That might be potential discriminant.

fmt_time transition_name state_name
2022-02-07T00:24:46.908138-07:00 TransitionType.NOP State.WAITING_FOR_TRIP_START
2022-02-07T00:24:46.916240-07:00 TransitionType.VISIT_ENDED State.WAITING_FOR_TRIP_START
2022-02-07T00:35:28.095417-07:00 TransitionType.VISIT_STARTED State.ONGOING_TRIP
-- -- --
2022-02-07T00:37:06.316291-07:00 TransitionType.VISIT_ENDED State.WAITING_FOR_TRIP_START
2022-02-07T00:37:08.928151-07:00 TransitionType.VISIT_STARTED State.ONGOING_TRIP
-- -- --
2022-02-07T00:50:28.613245-07:00 TransitionType.VISIT_ENDED State.WAITING_FOR_TRIP_START
2022-02-07T00:50:29.307538-07:00 TransitionType.VISIT_STARTED State.ONGOING_TRIP
-- -- --
2022-02-07T01:03:43.421707-07:00 TransitionType.VISIT_ENDED State.WAITING_FOR_TRIP_START
2022-02-07T01:04:02.212470-07:00 TransitionType.VISIT_STARTED State.ONGOING_TRIP
-- -- --
2022-02-07T01:28:44.901669-07:00 TransitionType.VISIT_ENDED State.WAITING_FOR_TRIP_START
2022-02-07T01:29:51.414304-07:00 TransitionType.VISIT_STARTED State.ONGOING_TRIP
-- -- --
shankari commented 2 years ago

Re-running the rest of the analysis, we now have 79 trips, so it looks like the issue resolved itself after the 8th?

start_ld = ecwld.LocalDate(year=2022, month=2, day=1)
end_ld = ecwld.LocalDate(year=2022, month=2, day=28)
all_jan_feb_confirmed_trip_df = ts.get_data_df("analysis/confirmed_trip", time_query=esttc.TimeComponentQuery("data.start_local_dt", start_ld, end_ld))
all_jan_feb_confirmed_trip_df["od_distance"] = all_jan_feb_confirmed_trip_df.apply(lambda r: ecc.calDistance(r.start_loc["coordinates"], r.end_loc["coordinates"], coordinates=False), axis=1)
all_feb_potential_bad_trips = all_jan_feb_confirmed_trip_df[all_jan_feb_confirmed_trip_df.od_distance < 100]
len(all_feb_potential_bad_trips), len(all_jan_feb_confirmed_trip_df)
Result: (58, 79)
start_local_dt_day end_local_dt_day
8 8
8 8
8 8
8 8
8 8
8 8
11 11
12 12
12 12
13 13

Looking at these last four trips, one has a clearly defined trajectory. The others are little groups of points, similar to some trips on the 8th.

But the number of locations seems like a potential discriminator.

pd.Series([len(ts.get_data_df("background/location", time_query = esdat.get_time_query_for_trip_like_object(ecwct.Confirmedtrip(t))))
    for t in all_feb_potential_bad_trips.to_dict(orient="records")]).unique()
Result: array([2227,   32,   24,   21,   41,   22,   15,   16,   55,   19,   12,
         20,   14,   34,   23,   28,   44,   67,   39,   69,   25,   35,
         49,   68,   38,   46,   75, 4032])

pd.Series([len(ts.get_data_df("background/filtered_location", time_query = esdat.get_time_query_for_trip_like_object(ecwct.Confirmedtrip(t))))
    for t in all_feb_potential_bad_trips.to_dict(orient="records")]).unique()
Result: array([2227,   15,    7,   25,   14,    8,   31,    5,   11,    9,   13,
         20,    6,   38,   39,   28,   27,   22,   16,   37,   34, 4032])

Still have the same potentially bad trips that are actually good. Plotting this, we get

pd.Series([len(ts.get_data_df("background/filtered_location", time_query = esdat.get_time_query_for_trip_like_object(ecwct.Confirmedtrip(t))))
    for t in all_feb_potential_bad_trips_actually_good.to_dict(orient="records")]).plot(kind="bar")

image

So it looks like that will work!

shankari commented 2 years ago

Double checking by mapping some known bad trips from the morning of the 7th

bad_plot_map

The last few trips on the 8th+ have one trip that looks like that, and others that just look like a cluster of points at the destination.

Screen Shot 2022-02-14 at 4 05 38 PM

So the big gap/sparse points seems like a good check, at least for this user at this time. Need to think about whether we want to incorporate it into the regular pipeline.

Double checking...

potential_bad_trips["n_locations"] = potential_bad_trips.apply(lambda t: len(ts.get_data_df("background/filtered_location",
                time_query = esdat.get_time_query_for_trip_like_object(ecwct.Confirmedtrip(t)))), axis=1)
potential_bad_trips.distance/potential_bad_trips.n_locations

1     1010.562504
2      490.593527
3     1014.293740
4      447.391163
5      890.398448
7     1415.947353
8      612.301309
9      788.545925
10     889.951896
11     532.434733
13     468.722233
14     890.146060
15     525.736338
16    1182.026649
17     507.363236
18    1153.445392
19     473.915800
20     996.316469
21     529.151662
24     879.441002
26    1013.337456
28     985.787600
30     550.428395
31    1011.751036
34     989.617100
36    1009.464400
37     460.970361
dtype: float64

And for the mixed dataset

all_feb_potential_bad_trips_actually_good["n_locations"] = all_feb_potential_bad_trips_actually_good.apply(lambda t: len(ts.get_data_df("background/filtered_location",
                time_query = esdat.get_time_query_for_trip_like_object(ecwct.Confirmedtrip(t)))), axis=1)
all_feb_potential_bad_trips_actually_good.distance / all_feb_potential_bad_trips_actually_good.n_locations

3       6.369119
11    562.643949
35    993.975949
37    321.591598
44    760.823952
47    634.802192
51    878.961963
71      7.899541
dtype: float64

Note that our filter distance is supposed to be 1 meter. https://github.com/e-mission/e-mission-data-collection/blob/master/src/ios/Wrapper/LocationTrackingConfig.m#L26

{'is_duty_cycling': True, 'filter_distance': 1, 'simulate_user_interaction': False, 'accuracy_threshold': 200, 'filter_time': -1, 'geofence_radius': 100, 'ios_use_visit_notifications_for_detection': True, 'ios_use_remote_push_for_sync': True, 'accuracy': 100, 'trip_end_stationary_mins': 10, 'android_geofence_responsiveness': -1}

So a possible threshold could be 100x that, so a density of > 100m

shankari commented 2 years ago

To summarize, our check for "invalid trip" is:

Let's see how many of these show up for this user overall

all_confirmed_trip_df = ts.get_data_df("analysis/confirmed_trip")
all_confirmed_trip_df["od_distance"] = all_confirmed_trip_df.apply(lambda r: ecc.calDistance(r.start_loc["coordinates"], r.end_loc["coordinates"], coordinates=False), axis=1)
all_confirmed_trip_df["mean_speed"] = all_confirmed_trip_df.distance / all_confirmed_trip_df.duration
first_three_checks_overall = all_confirmed_trip_df.query("od_distance < 100 and not (distance > 7500 and mean_speed < 10)")
len(first_three_checks_overall), len(all_confirmed_trip_df), (len(first_three_checks_overall)/len(all_confirmed_trip_df))
first_three_checks_overall[["start_local_dt_month", "start_local_dt_day", "end_local_dt_month", "end_local_dt_day", "od_distance", "distance", "mean_speed"]]
start_local_dt_month start_local_dt_day end_local_dt_month end_local_dt_day od_distance distance mean_speed
1 1 1 1 6.986333e+01 156.678443 0.327779
1 1 1 1 1.740171e+01 1084.129455 0.798355
1 5 1 5 8.995684e+01 363.770437 0.478790
1 5 1 5 4.107239e-02 1990.706388 1.316896
1 5 1 5 5.988431e+01 1132.902350 3.496865
1 18 1 18 6.652086e+01 1597.639593 2.206089
2 6 2 6 1.850196e-01 7103.825879 1.944089
2 6 2 6 2.342991e+00 7039.310841 1.698978
2 6 2 6 0.000000e+00 7112.693053 1.702703
2 7 2 7 1.853653e+01 7073.937531 11.576851
2 7 2 7 1.127387e+01 6868.309384 27.608938
2 7 2 7 1.170578e+01 7100.056180 5.205115
2 7 2 7 1.207159e+01 6710.867451 24.672423
2 7 2 7 1.223378e+01 7123.187586 12.798619
2 7 2 7 3.101800e+01 20965.968319 15.490884
2 7 2 7 2.990119e+01 7079.736764 11.872148
2 7 2 7 3.331056e+00 6735.314398 17.452288
2 7 2 7 3.156922e+00 7096.913321 11.676411
2 7 2 7 5.456963e-01 7119.615165 4.834291
2 7 2 7 1.297290e+01 6921.651532 27.754705
2 7 2 7 1.340543e+01 13834.872888 12.779537
2 7 2 7 8.192925e+00 7030.833489 10.756582
2 7 2 7 9.255376e+00 7121.168479 5.306681
2 7 2 7 9.398279e+00 6834.572396 18.271643
2 7 2 7 1.310545e+00 7092.159891 5.376771
2 7 2 7 6.030290e+00 7103.085299 17.716728
2 7 2 7 1.227530e+01 6920.672352 12.350568
2 7 2 7 5.145222e-01 7108.737007 8.657235
2 7 2 7 1.862717e+00 6974.215285 17.912286
2 7 2 7 8.986348e+00 6878.971601 18.440336
2 7 2 7 2.661333e-09 7035.528018 11.337731
2 7 2 7 4.796685e-01 7093.362191 2.808319
2 7 2 7 1.556720e+00 21974.470798 15.933950
2 7 2 7 1.363708e+00 6900.513198 30.312781
2 7 2 7 6.431294e-01 13970.753169 11.931122
2 7 2 7 8.361405e-01 7155.569139 4.667499
2 7 2 7 9.374204e-07 7082.257252 11.380397
2 7 2 7 8.591278e-01 6312.254022 13.618016
2 7 2 7 1.332504e+00 6927.319698 6.867789
2 7 2 7 1.445264e-09 7066.250800 4.685370
2 7 2 7 2.455396e+01 6914.555419 39.392666
2 7 2 7 8.173925e+00 21090.861869 13.985404
2 8 2 8 2.209519e-01 7153.275185 15.427448
2 8 2 8 3.200963e+00 6740.749776 14.768179
2 8 2 8 8.386498e+00 6807.663936 16.221843
2 8 2 8 1.137082e+01 7067.945941 1.764125
2 8 2 8 1.990353e-09 7089.220503 1.893732
2 8 2 8 2.814614e+01 7107.139223 12.469921
2 8 2 8 2.996777e+01 7136.545569 4.744430
2 8 2 8 1.526732e+01 7088.501464 19.577850
2 8 2 8 1.440534e+01 7028.174815 4.781001
2 8 2 8 5.645250e-01 14152.376098 10.298519
2 8 2 8 6.583064e-08 20947.048458 11.481887
2 8 2 8 7.913714e-08 7017.704820 8.595545
2 8 2 8 1.583436e-04 19993.302120 17.437462
2 8 2 8 3.959867e-04 12421.697166 13.026074
shankari commented 2 years ago

Recomputing in a different way, we get the same result:

first_check_overall = all_confirmed_trip_df.query("od_distance < 100")
next_two_checks_good = first_check_overall.query("distance > 7500 and mean_speed < 10")
next_two_checks_good[["start_local_dt_month", "start_local_dt_day", "end_local_dt_month", "end_local_dt_day", "od_distance", "distance", "mean_speed"]]
start_local_dt_month start_local_dt_day end_local_dt_month end_local_dt_day od_distance distance mean_speed
1 4 1 4 5.124098e+01 11516.539891 8.191983
2 1 2 1 1.678697e+01 14184.026960 6.145148
2 6 2 6 5.955038e-03 14066.098733 9.495479
2 7 2 7 2.087384e-01 13915.663279 8.857429
2 7 2 7 4.796685e-01 12220.480719 7.869666
2 7 2 7 7.900875e-10 20542.246691 8.604319
2 7 2 7 4.733766e-01 13965.648219 7.731872
2 7 2 7 7.117096e+00 14063.391410 4.487928
2 8 2 8 2.713432e+00 31850.950049 7.167856
first_three_checks_overall = first_check_overall[np.logical_not(np.logical_and(first_check_overall.distance > 7500, first_check_overall.mean_speed < 10))]
len(first_check_overall), len(next_two_checks_good), len(first_three_checks_overall), len(all_confirmed_trip_df), (len(first_three_checks_overall)/len(all_confirmed_trip_df))

(65, 9, 56, 140, 0.4)

Visualizing the maps before 6th Feb, we get a bunch of valid trips. We need to add the density check as well.

shankari commented 2 years ago

After adding the density check, it looks good.

first_three_checks_overall["n_locations"] = first_three_checks_overall.apply(lambda t: len(ts.get_data_df("background/filtered_location",
                time_query = esdat.get_time_query_for_trip_like_object(ecwct.Confirmedtrip(t)))), axis=1)
first_three_checks_overall["loc_density"] = first_three_checks_overall.distance / first_three_checks_overall.n_locations
all_four_checks = first_three_checks_overall[first_three_checks_overall.loc_density > 100]
len(first_check_overall), len(next_two_checks_good), len(first_three_checks_overall), len(all_four_checks), len(all_confirmed_trip_df), (len(first_three_checks_overall)/len(all_confirmed_trip_df)), (len(all_four_checks)/len(all_confirmed_trip_df))

Result: (65, 9, 56, 50, 140, 0.4, 0.35714285714285715)
start_local_dt_month start_local_dt_day end_local_dt_month end_local_dt_day od_distance distance mean_speed
2 6 2 6 1.850196e-01 7103.825879 1.944089
2 6 2 6 2.342991e+00 7039.310841 1.698978
2 6 2 6 0.000000e+00 7112.693053 1.702703
2 7 2 7 1.853653e+01 7073.937531 11.576851
2 7 2 7 1.127387e+01 6868.309384 27.608938
2 7 2 7 1.170578e+01 7100.056180 5.205115
2 7 2 7 1.207159e+01 6710.867451 24.672423
2 7 2 7 1.223378e+01 7123.187586 12.798619
2 7 2 7 3.101800e+01 20965.968319 15.490884
2 7 2 7 2.990119e+01 7079.736764 11.872148
2 7 2 7 3.331056e+00 6735.314398 17.452288
2 7 2 7 3.156922e+00 7096.913321 11.676411
2 7 2 7 5.456963e-01 7119.615165 4.834291
2 7 2 7 1.297290e+01 6921.651532 27.754705
2 7 2 7 1.340543e+01 13834.872888 12.779537
2 7 2 7 8.192925e+00 7030.833489 10.756582
2 7 2 7 9.255376e+00 7121.168479 5.306681
2 7 2 7 9.398279e+00 6834.572396 18.271643
2 7 2 7 1.310545e+00 7092.159891 5.376771
2 7 2 7 6.030290e+00 7103.085299 17.716728
2 7 2 7 1.227530e+01 6920.672352 12.350568
2 7 2 7 5.145222e-01 7108.737007 8.657235
2 7 2 7 1.862717e+00 6974.215285 17.912286
2 7 2 7 8.986348e+00 6878.971601 18.440336
2 7 2 7 2.661333e-09 7035.528018 11.337731
2 7 2 7 4.796685e-01 7093.362191 2.808319
2 7 2 7 1.556720e+00 21974.470798 15.933950
2 7 2 7 1.363708e+00 6900.513198 30.312781
2 7 2 7 6.431294e-01 13970.753169 11.931122
2 7 2 7 8.361405e-01 7155.569139 4.667499
2 7 2 7 9.374204e-07 7082.257252 11.380397
2 7 2 7 8.591278e-01 6312.254022 13.618016
2 7 2 7 1.332504e+00 6927.319698 6.867789
2 7 2 7 1.445264e-09 7066.250800 4.685370
2 7 2 7 2.455396e+01 6914.555419 39.392666
2 7 2 7 8.173925e+00 21090.861869 13.985404
2 8 2 8 2.209519e-01 7153.275185 15.427448
2 8 2 8 3.200963e+00 6740.749776 14.768179
2 8 2 8 8.386498e+00 6807.663936 16.221843
2 8 2 8 1.137082e+01 7067.945941 1.764125
2 8 2 8 1.990353e-09 7089.220503 1.893732
2 8 2 8 2.814614e+01 7107.139223 12.469921
2 8 2 8 2.996777e+01 7136.545569 4.744430
2 8 2 8 1.526732e+01 7088.501464 19.577850
2 8 2 8 1.440534e+01 7028.174815 4.781001
2 8 2 8 5.645250e-01 14152.376098 10.298519
2 8 2 8 6.583064e-08 20947.048458 11.481887
2 8 2 8 7.913714e-08 7017.704820 8.595545
2 8 2 8 1.583436e-04 19993.302120 17.437462
2 8 2 8 3.959867e-04 12421.697166 13.026074
shankari commented 2 years ago

Note also that the user does not have any motion activity data.

$ zgrep background /tmp/tmp/emission_ind_....gz | sort | uniq
            "key": "background/battery",
            "key": "background/filtered_location",
            "key": "background/location",

I wonder if that is the reason why our spurious trip detection code is not catching this automatically

shankari commented 2 years ago

If we were incorporating this into the pipeline, we would reset the pipeline to before the 6th and then re-run. Since we are not planning to do that right now, we will instead insert the mode_confirm, purpose_confirm, ... objects through a script. Then when we run the pipeline again, the input matching will find the corresponding match.

Before inserting the entries, the user inputs are as below. It looks like the user confirmed several trips before stopping. Need to confirm with them if they are indeed inaccurate.

mode_confirm purpose_confirm replaced_mode
error not_accurate not_accurate
pilot_ebike work drove_alone
pilot_ebike work drove_alone
pilot_ebike work drove_alone
pilot_ebike work drove_alone
pilot_ebike work drove_alone
pilot_ebike work drove_alone
pilot_ebike work drove_alone
pilot_ebike personal_med drove_alone
pilot_ebike work drove_alone
pilot_ebike work drove_alone
pilot_ebike work drove_alone
pilot_ebike work drove_alone
pilot_ebike work drove_alone
NaN NaN NaN
NaN NaN NaN
NaN NaN NaN
NaN NaN NaN
NaN NaN NaN
NaN NaN NaN
NaN NaN NaN
NaN NaN NaN
NaN NaN NaN
NaN NaN NaN
NaN NaN NaN
NaN NaN NaN
NaN NaN NaN
NaN NaN NaN
NaN NaN NaN
NaN NaN NaN
NaN NaN NaN
NaN NaN NaN
NaN NaN NaN
NaN NaN NaN
NaN NaN NaN
NaN NaN NaN
NaN NaN NaN
NaN NaN NaN
NaN NaN NaN
NaN NaN NaN
NaN NaN NaN
NaN NaN NaN
NaN NaN NaN
NaN NaN NaN
NaN NaN NaN
NaN NaN NaN
NaN NaN NaN
NaN NaN NaN
NaN NaN NaN
not_a_trip error not_accurate
shankari commented 2 years ago

After manually inserting entries on a copy of the database and then re-running the pipeline, we get

mode_confirm purpose_confirm replaced_mode
not_a_trip not_a_trip not_accurate
not_a_trip not_a_trip drove_alone
not_a_trip not_a_trip drove_alone
not_a_trip not_a_trip drove_alone
not_a_trip not_a_trip drove_alone
not_a_trip not_a_trip drove_alone
not_a_trip not_a_trip drove_alone

After configuring the analysis pipeline to included replaced mode, we get

mode_confirm purpose_confirm replaced_mode
not_a_trip not_a_trip not_a_trip
not_a_trip not_a_trip not_a_trip
not_a_trip not_a_trip not_a_trip
not_a_trip not_a_trip not_a_trip
not_a_trip not_a_trip not_a_trip
not_a_trip not_a_trip not_a_trip

We are now ready to change this on the production server once we get confirmation from the user that the trips are in fact spurious.