Feat/Segment trip time page

TTalex commented 1 year ago

Hey,

This feature adds a new page helping users compute average trip duration between two selected points.

I wanted to experiment with a simple level of service indicator based on e-mission data.

The use case comes from feedback from a local authority in France. They expressed a need for alternative ways of gathering travel time information that doesn't rely on buying data from historic actors.

I do believe that e-mission can be pretty good for this, since trip completeness isn't required to compute average durations. We can get pretty good results with low amounts of data.

The initial idea sparked a discussion around map matching. While results would probably be more accurate with map matching, simple curve-fitting, as done when creating analysis/recreated_locations entries, seems to already be performing well enough.

User interaction

Here is a quick demo of the new page (the database only has one user data, mine). segment_trip_time_demo

The user is asked to

Select a range around start and end points
Select a start point using the left map
Select an end point using the right map

Using this information, queries are made to fetch recreated_locations matching either points. Resulting data is then displayed.

The proof of concept includes the following tables:

Median segment duration by mode of transport : This is the average time taken from start to end, divided by mode of transportation
Median segment duration by mode and hour of the day (UTC): Same as before, also divided by hour. This helps to identify peak congestion times (usually during morning and evening commute for drivers)
Median segment duration by hour of the day (UTC): Same, without mode
Median segment duration by mode and month: This is especially helpful to measure the impact of holidays
Full trip data: Allows the user to build her own tables and visualizations

More useful stats could be added in the future, for example separating between weekdays and weekends.

A commonly used statistic is average vehicle speed. This requires information on true distance travelled. However, distance is complex to compute with recreated_locations, since we only fetch start and end points, we lose the distance information in intermediary points. This would require further queries to sum all intermediary points distances. The same is true for speeds. This is not complex, but could be heavy on the database / memory. In reality, the distance is most likely already known by the user, at least for the "usual" path.

New project requirements

I have added the dash_leaflet library because built-in mapbox plots aren't great:

Click events only work on markers, you cannot detect clicks on an empty map
Drawing circles with a range in meters is not easy. Markers circles are in pixels and ignore zoom, so the solution is to rebuild a polyline with 360 points making a circle... where leaflet has a built-in Circle method

Permissions

Three configurations are linked to this new page:

segment_trip_time: User can view this page. (default true)
segment_trip_time_full_trips: User can see the table containing non-aggregated data (default true)
segment_trip_time_min_users: Minimal number of distinct users in data required to display anything (value is a number, default 0). This parameter should help with guaranteeing some kind of anonymity, otherwise a user could target a specific house as a start point, and leak personal travel data that way.

Dev notes

Up until now, the admin-dashboard was querying data once at startup. This PR behaves differently, with database queries on user actions. The code includes a few comments on why this was done this way, and the performances implications.

Hope this is useful for someone else :)

shankari commented 1 year ago

@TTalex I think that this is a great feature, and super well documented! We just deployed a release, but are planning a new one in a couple of weeks for the API upgrade, and it would be great to include this as well. I'll review and merge this weekend.

shankari commented 1 year ago

FYI, map matching has been postponed in favor of helping finish up "count every trip"

TTalex commented 1 year ago

Quoting myself from the first message of this PR:

A commonly used statistic is average vehicle speed. This requires information on true distance travelled. However, distance is complex to compute with recreated_locations, since we only fetch start and end points, we lose the distance information in intermediary points. This would require further queries to sum all intermediary points distances. The same is true for speeds. This is not complex, but could be heavy on the database / memory. In reality, the distance is most likely already known by the user, at least for the "usual" path.

This could be solved by having a total_distance_from_start field in the recreated location. I wonder if this would be an interesting feature for other use cases ?

I believe it could be implemented in add_dist_heading_speed https://github.com/e-mission/e-mission-server/blob/f78a22b7735e3877f31b29a4f7029dbd182416d4/emission/analysis/intake/cleaning/location_smoothing.py#L71 as follows:

+ import itertools
# [...]
def add_dist_heading_speed(points_df):
    # type: (pandas.DataFrame) -> pandas.DataFrame
    """
    Returns a new dataframe with an added "speed" column.
    The speed column has the speed between each point and its previous point.
    The first row has a speed of zero.
    """
    point_list = [ad.AttrDict(row) for row in points_df.to_dict('records')]
    zipped_points_list = list(zip(point_list, point_list[1:]))

    distances = [pf.calDistance(p1, p2) for (p1, p2) in zipped_points_list]
    distances.insert(0, 0)
+   distances_from_start = list(itertools.accumulate(distances))
    speeds = [pf.calSpeed(p1, p2) for (p1, p2) in zipped_points_list]
    speeds.insert(0, 0)
    headings = [pf.calHeading(p1, p2) for (p1, p2) in zipped_points_list]
    headings.insert(0, 0)

    with_distances_df = pd.concat([points_df, pd.Series(distances, name="distance")], axis=1)
+   with_distances_from_start_df = pd.concat([with_distances_df, pd.Series(distances_from_start, name="distance_from_start")], axis=1)
-   with_speeds_df = pd.concat([with_distances_df, pd.Series(speeds, name="speed")], axis=1)
+   with_speeds_df = pd.concat([with_distances_from_start_df, pd.Series(speeds, name="speed")], axis=1)
    if "heading" in with_speeds_df.columns:
        with_speeds_df.drop("heading", axis=1, inplace=True)
    with_headings_df = pd.concat([with_speeds_df, pd.Series(headings, name="heading")], axis=1)
    return with_headings_df

Recreated locations would then look like the following examples: {metadata: {key: "analysis/recreated_location", [...]}, data: {idx: 0, distance: 0, distance_from_start:0, [...]}} {metadata: {key: "analysis/recreated_location", [...]}, data: {idx: 1, distance: 105, distance_from_start:105, [...]}} {metadata: {key: "analysis/recreated_location", [...]}, data: {idx: 2, distance: 65, distance_from_start:170, [...]}} {metadata: {key: "analysis/recreated_location", [...]}, data: {idx: 3, distance: 100, distance_from_start:270, [...]}}

Computing distance from the second point (idx 1) to the last one (idx 3) would only require both points, skipping fetching idx 2: 270-105=165

Maybe it's a bit too specific for this use case to induce a change to the Location model. (and might require a patch on existing database entries for consistency 😕)

shankari commented 11 months ago

@TTalex we have created an interface for the cleaned2inferredsections mapping. Can you switch to it? LMK if you are too busy, and one of us can handle it. https://github.com/e-mission/e-mission-server/pull/937

We can then change the implementation at will depending on the scalability vs. reuse tradeoff.

shankari commented 11 months ago

This could be solved by having a total_distance_from_start field in the recreated location. I wonder if this would be an interesting feature for other use cases ?

This is an interesting thought. Adding new entries to the data model and patching existing entries is work but fairly straightforward conceptually. If we can come up with a second use case that needs this functionality (maybe map matching), I am happy to include it. Not sure if we want to do a one-off change before that though...

TTalex commented 11 months ago

@TTalex we have created an interface for the cleaned2inferredsections mapping. Can you switch to it? LMK if you are too busy, and one of us can handle it. e-mission/e-mission-server#937

Sweet, thanks, I've made the swap.

TTalex commented 11 months ago

This is an interesting thought. Adding new entries to the data model and patching existing entries is work but fairly straightforward conceptually. If we can come up with a second use case that needs this functionality (maybe map matching), I am happy to include it. Not sure if we want to do a one-off change before that though...

I wouldn't change it either if I were you, that's why I didn't bother doing a PR :)

There might be some use cases in end users UI for it, where the change would induce slight performance improvements. For example, with point by point visualization such as this one:

Peek 2023-10-02 11-20

But I'm not confident it would be an improvement at all, since the full list of points has to be loaded anyway.

shankari commented 10 months ago

I tried testing it on some real-world data in Denver and only got 3 trips, which seems a bit low. Going to try it against my own data...

shankari commented 10 months ago

While testing against my own data, ran into the following error

Traceback (most recent call last):
  File "/usr/src/app/pages/segment_trip_time.py", line 181, in generate_content_on_endpoints_change
    mode_by_section_id = db_utils.query_inferred_sections_modes(
  File "/usr/src/app/utils/db_utils.py", line 247, in query_inferred_sections_modes
    return esds.cleaned2inferred_section_list(sections)
  File "/usr/src/app/emission/storage/decorations/section_queries.py", line 51, in cleaned2inferred_section_list
    matching_inferred_section = cleaned2inferred_section(section_userid.get('user_id'), section_userid.get('section'))
  File "/usr/src/app/emission/storage/decorations/section_queries.py", line 45, in cleaned2inferred_section
    curr_predicted_entry = _get_inference_entry_for_section(user_id, section_id, "analysis/inferred_section", "data.cleaned_section")
  File "/usr/src/app/emission/storage/decorations/section_queries.py", line 66, in _get_inference_entry_for_section
    assert len(ret_list) <= 1, "Found len(ret_list) = %d, expected <=1" % len(ret_list)
AssertionError: Found len(ret_list) = 807, expected <=1

shankari commented 10 months ago

Looking at the logs, we have

op-admin-dash-dashboard-1  | DEBUG:root:About to query {'metadata.key': 'analysis/inferred_section', 'user_id': UUID('9c084ef4-2f97-4196-bd37-950c17938ec6'), 'data.cleaned_section': ObjectId('643874ce88f9b4eda2beca67')}
op-admin-dash-dashboard-1  | DEBUG:root:About to query {'metadata.key': 'analysis/inferred_section', 'user_id': UUID('d83a43a1-df6b-42ed-986f-f5b5f6150221'), 'data.cleaned_section': ObjectId('644e58ecb14cecd84298aae4')}
op-admin-dash-dashboard-1  | DEBUG:root:Found no inferred prediction, returning None
op-admin-dash-dashboard-1  | DEBUG:root:About to query {'metadata.key': 'analysis/inferred_section', 'user_id': UUID('d83a43a1-df6b-42ed-986f-f5b5f6150221'), 'data.cleaned_section': ObjectId('64532ffeedc48e75d0268b04')}
op-admin-dash-dashboard-1  | DEBUG:root:Found no inferred prediction, returning None
op-admin-dash-dashboard-1  | DEBUG:root:About to query {'metadata.key': 'analysis/inferred_section', 'user_id': UUID('16c2d3cd-6d62-42dc-98df-6d927cd9a3c8'), 'data.cleaned_section': ObjectId('62db2032a6977e4c0214befe')}
op-admin-dash-dashboard-1  | DEBUG:root:About to query {'metadata.key': 'analysis/inferred_section', 'user_id': UUID('d83a43a1-df6b-42ed-986f-f5b5f6150221'), 'data.cleaned_section': ObjectId('644df8edea199f1d0473e301')}

so it must be one of these sections

shankari commented 10 months ago

Bingo!

# ./e-mission-py.bash
Python 3.9.18 | packaged by conda-forge | (main, Aug 30 2023, 03:49:32)
[GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import emission.core.get_database as edb
Connecting to database URL mongodb://db/openpath_stage
>>> from uuid import UUID
>>> from bson.objectid import ObjectId
>>> edb.get_analysis_timeseries_db().find({'metadata.key': 'analysis/inferred_section', 'user_id': UUID('d83a43a1-df6b-42ed-986f-f5b5f6150221'), 'data.cleaned_section': ObjectId('644df8edea199f1d0473e301')})
<pymongo.cursor.Cursor object at 0x7f1e66313970>
>>> edb.get_analysis_timeseries_db().count_documents({'metadata.key': 'analysis/inferred_section', 'user_id': UUID('d83a43a1-df6b-42ed-986f-f5b5f6150221'), 'data.cleaned_section': ObjectId('644df8edea199f1d0473e301')})
807
>>> edb.get_analysis_timeseries_db().count_documents({'metadata.key': 'analysis/inferred_section', 'user_id': UUID('16c2d3cd-6d62-42dc-98df-6d927cd9a3c8'), 'data.cleaned_section': ObjectId('62db2032a6977e4c0214befe')})
1

This is almost certainly due to https://github.com/e-mission/e-mission-docs/issues/927#issuecomment-1606541814

My current guess is that it might be due to multiple calls to overpass failing, but in that previous issue, I see:

That seemed weird, since there did not appear to be any errors while generating the mode inference.

shankari commented 10 months ago

It's really weird that there are still exactly 807 matching entries. Maybe we can spend a little time today to investigate (at least on the side)

shankari commented 10 months ago

After resetting the pipeline for that user, we get 30 trips. Which is not that much but is at least greater than 3. I am going to merge this for now, but also poke around a bit to see if this is actually correct.

shankari commented 10 months ago

After resetting and re-running the pipeline, we get 14k points at the start and 824 points at the end. But only 31 overlaps. I bet it is because of shorter segments. Let's try with a smaller segment of road so that it is more likely to be in the same section.

op-admin-dash-dashboard-1  | DEBUG:root:Found 14390 results
op-admin-dash-dashboard-1  | DEBUG:root:After de-duping, converted 14390 points to 14390

op-admin-dash-dashboard-1  | DEBUG:root:Found 824 results
op-admin-dash-dashboard-1  | DEBUG:root:After de-duping, converted 824 points to 824

shankari commented 10 months ago

As expected, there are more trips for a short segment although still not as many as we would like.

shankari commented 10 months ago

Ran into the duplicate entries for another user https://github.com/e-mission/e-mission-docs/issues/927#issuecomment-1784168435

We might want to write a check for this and run it on production before pushing it out.

Let's switch to open access for a bit to see if things are better.

achasmita commented 10 months ago

I got more trips when I selected bigger zone:

![Screen Shot 2023-11-08 at 3 35 44 PM](https://github.com/e-mission/op-admin-dashboard/assets/79387860/a97187b5-405d-43a3-b25b-d570392ba4e6) ![Screen Shot 2023-11-08 at 3 36 00 PM](https://github.com/e-mission/op-admin-dashboard/assets/79387860/6b8b88b5-3295-4d06-a1c2-19d16c8ecfc6) ![Screen Shot 2023-11-08 at 3 36 14 PM](https://github.com/e-mission/op-admin-dashboard/assets/79387860/cae206b6-2b3a-49be-9bf7-43c555e95c50)

shankari commented 10 months ago

@achasmita I don't see how big these zones are. How did you pick them? How do you validate that the number of trips is correct?

achasmita commented 9 months ago

@achasmita I don't see how big these zones are. How did you pick them? How do you validate that the number of trips is correct?

While selecting zone, I checked trip table to find the best areas based on latitude and longitude that had the most trips:

![Screen Shot 2023-11-18 at 9 02 05 PM](https://github.com/e-mission/op-admin-dashboard/assets/79387860/f979eab6-36c2-47e1-a316-35ba5c86d72c) ![Screen Shot 2023-11-18 at 9 21 20 PM](https://github.com/e-mission/op-admin-dashboard/assets/79387860/506eb91c-b650-4af1-859a-ff99a116cfde) ![Screen Shot 2023-11-18 at 9 21 35 PM](https://github.com/e-mission/op-admin-dashboard/assets/79387860/92737371-46aa-4792-b89c-5b8ecc4b6c33) > I also observed locs_matching_start amd locs_matching_end data before removing duplicates and after removing duplicates and once they are filtered: >> locs_matching_start before and after removing duplicates (left) and locs_matching_end before and after removing duplicates (right) ![image](https://github.com/e-mission/op-admin-dashboard/assets/79387860/3d5b7a53-da2c-44a5-ab96-7400edd2ea6c) - After selecting the start and end zone, duplicate sections are removed. It is keeping first occured section and removing the other occurences. >>data after merged and filtered (After removing duplicates and after filtering data for which [merged['idx_x'] < merged['idx_y']]) ![image](https://github.com/e-mission/op-admin-dashboard/assets/79387860/36af865d-9292-40d9-8d59-a54da7ffdede) After observing data what I can see now is: - Also if there is an overlap it will not include those section as both section will have same idx. - And manually verified this result: ![Screen Shot 2023-11-19 at 5 06 40 PM](https://github.com/e-mission/op-admin-dashboard/assets/79387860/9b4e510f-9430-4381-9e6e-9d1fd76147f5)

achasmita commented 9 months ago

I plotted some co-ordinates (25-30) from trip table manually by checking it on google map and it was giving the correct result for number of trip and I verified ObjectId and UserId.

Expanded version, dictated to @shankari

I picked 25-30 trips from the trip table that had similar start and end locations
I plotted these on Google Maps so visualize their spatial location
Then, I determined, using the visualization, what the polygons that cover them are
then, I drew those polygons on the trip time table
Then, I saw the same number of trips and they had the same ObjectId and UserId

This shows us that the trips that are within the start and end polygons are shown in the trip time table. It does not show us that the trip time table shows all trips with that start and end polygon.

Concretely, if the trips that you found were $t_1, t2,....t{30}$ and there were 5 other trips $t{31}...t{35}$ that were in the same polygons, but you didn't spot check them because they were on page 25, then you don't know that they were excluded.

My concern is not that the trip time table is inaccurate, but that it is incomplete.

achasmita commented 9 months ago

I tried exploring start and end zone with different segment size: -I was getting very few trips

I also explored data in both zone(start/end)
I was not able to figure out anything new.

shankari commented 9 months ago

@achasmita Thank you for adding additional examples with the results of your investigation. I have some more questions.

You said that you picked ~ 25-30 trips, but there are actually 105 trips displayed above. Can you explain this discrepancy?
I don't understand the follow-on.

locs_matching_start before and after removing duplicates (left) and locs_matching_end before and after removing duplicates (right) After selecting the start and end zone, duplicate sections are removed. It is keeping first occured section and removing the other occurences. But we first select the start and end zones - that is what allows us to find the locs_matching_start and locs_matching_end, right?

Can you also expand on what you did in "I also explored data in both zone(start/end)"? How did you explore the data, and what were the results?

achasmita commented 9 months ago

@achasmita Thank you for adding additional examples with the results of your investigation. I have some more questions.

You said that you picked ~ 25-30 trips, but there are actually 105 trips displayed above. Can you explain this discrepancy?

I don't understand the follow-on.

locs_matching_start before and after removing duplicates (left) and locs_matching_end before and after removing duplicates (right) After selecting the start and end zone, duplicate sections are removed. It is keeping first occured section and removing the other occurences. But we first select the start and end zones - that is what allows us to find the locs_matching_start and locs_matching_end, right?

Can you also expand on what you did in "I also explored data in both zone(start/end)"? How did you explore the data, and what were the results?

The above screeshot was just to make sure if i am selecting correct size of zone, I will find the other screenshot for 25-30 trips and post it soon.

For data, I printed tops 50 and bottom 50 data after removing duplicates and compared it with data in trip table to see if i can figure out if any data is missing.

shankari commented 9 months ago

@achasmita

For data, I printed tops 50 and bottom 50 data after removing duplicates and compared it with data in trip table to see if i can figure out if any data is missing.

Can you expand on this?

achasmita commented 9 months ago

@achasmita

For data, I printed tops 50 and bottom 50 data after removing duplicates and compared it with data in trip table to see if i can figure out if any data is missing.

Can you expand on this?

I printed data on locs_matching_start once the duplicates are excluded
I compared the result with data on trip table
As I selected start zone considering start coordinates (data.start_loc_coordinates) in trip table.

shankari commented 7 months ago

My main concern with this is we were getting very few trips displayed for basically any start/end combo.

I want to see "large number" of trips OR
I want some explanation of why they don't exist

@achasmita was not able to get (1) without making the polygons really large and wasn't able to come up with (2). The next steps that I was going to do was to look at locations where I anticipate having a lot of trips and then seeing why they don't show up in the list

Aa a concrete example, on staging, I would expect to see a lot of trips from my house to the library or to the grocery store nearby or to my kids' school. In particular, I would expect to see at least 100 trips from my house to the local school. Similarly, in the Denver area, you could see the locations that are hotspots in the heatmap and try to see if there are trips between them.

JGreenlee commented 6 months ago

While trying this branch, I initially got this error:

AttributeError: module 'emission.storage.decorations.section_queries' has no attribute 'cleaned2inferred_section_list'

I realized this is because this branch was using an old image of e-mission-server (shankari/e-mission-server:gis-based-mode-detection_2023-04-21--54-09). Which was probably before cleaned2inferred_section_list was added.

Updated to the most recent (shankari/e-mission-server:master_2024-02-10--19-38) and rebuilt.

Now there's a different error:

Traceback (most recent call last):
  File "/usr/src/app/pages/segment_trip_time.py", line 181, in generate_content_on_endpoints_change
    mode_by_section_id = db_utils.query_inferred_sections_modes(
  File "/usr/src/app/utils/db_utils.py", line 200, in query_inferred_sections_modes
    return esds.cleaned2inferred_section_list(sections)
  File "/usr/src/app/emission/storage/decorations/section_queries.py", line 51, in cleaned2inferred_section_list
    matching_inferred_section = cleaned2inferred_section(section_userid.get('user_id'), section_userid.get('section'))
  File "/usr/src/app/emission/storage/decorations/section_queries.py", line 45, in cleaned2inferred_section
    curr_predicted_entry = _get_inference_entry_for_section(user_id, section_id, "analysis/inferred_section", "data.cleaned_section")
  File "/usr/src/app/emission/storage/decorations/section_queries.py", line 66, in _get_inference_entry_for_section
    assert len(ret_list) <= 1, "Found len(ret_list) = %d, expected <=1" % len(ret_list)
AssertionError: Found len(ret_list) = 807, expected <=1

There are 807 inferred sections for one cleaned section??

I inspected the logs to see which UUID + section this is happening for. It happens on: DEBUG:root:About to query {'metadata.key': 'analysis/inferred_section', 'user_id': UUID('d83a43a1-df6b-42ed-986f-f5b5f6150221'), 'data.cleaned_section': ObjectId('644df8edea199f1d0473e301')}

Sure enough, there are 807 inferred section entries for that UUID and that cleaned section. They must be duplicate entries because they are identical except for their _id

query = {
  'metadata.key': 'analysis/inferred_section',
  'user_id': UUID('d83a43a1-df6b-42ed-986f-f5b5f6150221'),
  'data.cleaned_section': ObjectId('644df8edea199f1d0473e301')
}

r = edb.get_analysis_timeseries_db().find(query)
for i in r:
    print(i)

{'_id': ObjectId('644e6f67b14cecd84298f6b2'), 'user_id': UUID('d83a43a1-df6b-42ed-986f-f5b5f6150221'), 'metadata': {'key': 'analysis/inferred_section', 'platform': 'server', 'write_ts': 1682831597.6901028, 'time_zone': 'America/Los_Angeles', 'write_local_dt': {'year': 2023, 'month': 4, 'day': 29, 'hour': 22, 'minute': 13, 'second': 17, 'weekday': 5, 'timezone': 'America/Los_Angeles'}, 'write_fmt_time': '2023-04-29T22:13:17.690103-07:00'}, 'data': {'source': 'SmoothedHighConfidenceMotion', 'trip_id': ObjectId('644df8dbea199f1d0473e2ff'), 'start_ts': 1659712158.517581, 'start_local_dt': {'year': 2022, 'month': 8, 'day': 5, 'hour': 8, 'minute': 9, 'second': 18, 'weekday': 4, 'timezone': 'America/Los_Angeles'}, 'start_fmt_time': '2022-08-05T08:09:18.517581-07:00', ...
{'_id': ObjectId('644e7949cdabcb78bc676484'), 'user_id': UUID('d83a43a1-df6b-42ed-986f-f5b5f6150221'), 'metadata': {'key': 'analysis/inferred_section', 'platform': 'server', 'write_ts': 1682831597.6901028, 'time_zone': 'America/Los_Angeles', 'write_local_dt': {'year': 2023, 'month': 4, 'day': 29, 'hour': 22, 'minute': 13, 'second': 17, 'weekday': 5, 'timezone': 'America/Los_Angeles'}, 'write_fmt_time': '2023-04-29T22:13:17.690103-07:00'}, 'data': {'source': 'SmoothedHighConfidenceMotion', 'trip_id': ObjectId('644df8dbea199f1d0473e2ff'), 'start_ts': 1659712158.517581, 'start_local_dt': {'year': 2022, 'month': 8, 'day': 5, 'hour': 8, 'minute': 9, 'second': 18, 'weekday': 4, 'timezone': 'America/Los_Angeles'}, 'start_fmt_time': '2022-08-05T08:09:18.517581-07:00', ...
{'_id': ObjectId('644e88388464a359f04a7c74'), 'user_id': UUID('d83a43a1-df6b-42ed-986f-f5b5f6150221'), 'metadata': {'key': 'analysis/inferred_section', 'platform': 'server', 'write_ts': 1682831597.6901028, 'time_zone': 'America/Los_Angeles', 'write_local_dt': {'year': 2023, 'month': 4, 'day': 29, 'hour': 22, 'minute': 13, 'second': 17, 'weekday': 5, 'timezone': 'America/Los_Angeles'}, 'write_fmt_time': '2023-04-29T22:13:17.690103-07:00'}, 'data': {'source': 'SmoothedHighConfidenceMotion', 'trip_id': ObjectId('644df8dbea199f1d0473e2ff'), 'start_ts': 1659712158.517581, 'start_local_dt': {'year': 2022, 'month': 8, 'day': 5, 'hour': 8, 'minute': 9, 'second': 18, 'weekday': 4, 'timezone': 'America/Los_Angeles'}, 'start_fmt_time': '2022-08-05T08:09:18.517581-07:00', ...
{'_id': ObjectId('644e95d6fac3c75f1a08eb28'), 'user_id': UUID('d83a43a1-df6b-42ed-986f-f5b5f6150221'), 'metadata': {'key': 'analysis/inferred_section', 'platform': 'server', 'write_ts': 1682831597.6901028, 'time_zone': 'America/Los_Angeles', 'write_local_dt': {'year': 2023, 'month': 4, 'day': 29, 'hour': 22, 'minute': 13, 'second': 17, 'weekday': 5, 'timezone': 'America/Los_Angeles'}, 'write_fmt_time': '2023-04-29T22:13:17.690103-07:00'}, 'data': {'source': 'SmoothedHighConfidenceMotion', 'trip_id': ObjectId('644df8dbea199f1d0473e2ff'), 'start_ts': 1659712158.517581, 'start_local_dt': {'year': 2022, 'month': 8, 'day': 5, 'hour': 8, 'minute': 9, 'second': 18, 'weekday': 4, 'timezone': 'America/Los_Angeles'}, 'start_fmt_time': '2022-08-05T08:09:18.517581-07:00', ...
{'_id': ObjectId('644ea4395e9649cf426dd33d'), 'user_id': UUID('d83a43a1-df6b-42ed-986f-f5b5f6150221'), 'metadata': {'key': 'analysis/inferred_section', 'platform': 'server', 'write_ts': 1682831597.6901028, 'time_zone': 'America/Los_Angeles', 'write_local_dt': {'year': 2023, 'month': 4, 'day': 29, 'hour': 22, 'minute': 13, 'second': 17, 'weekday': 5, 'timezone': 'America/Los_Angeles'}, 'write_fmt_time': '2023-04-29T22:13:17.690103-07:00'}, 'data': {'source': 'SmoothedHighConfidenceMotion', 'trip_id': ObjectId('644df8dbea199f1d0473e2ff'), 'start_ts': 1659712158.517581, 'start_local_dt': {'year': 2022, 'month': 8, 'day': 5, 'hour': 8, 'minute': 9, 'second': 18, 'weekday': 4, 'timezone': 'America/Los_Angeles'}, 'start_fmt_time': '2022-08-05T08:09:18.517581-07:00', ...
{'_id': ObjectId('644eb16ab2270c7ba1ae53fc'), 'user_id': UUID('d83a43a1-df6b-42ed-986f-f5b5f6150221'), 'metadata': {'key': 'analysis/inferred_section', 'platform': 'server', 'write_ts': 1682831597.6901028, 'time_zone': 'America/Los_Angeles', 'write_local_dt': {'year': 2023, 'month': 4, 'day': 29, 'hour': 22, 'minute': 13, 'second': 17, 'weekday': 5, 'timezone': 'America/Los_Angeles'}, 'write_fmt_time': '2023-04-29T22:13:17.690103-07:00'}, 'data': {'source': 'SmoothedHighConfidenceMotion', 'trip_id': ObjectId('644df8dbea199f1d0473e2ff'), 'start_ts': 1659712158.517581, 'start_local_dt': {'year': 2022, 'month': 8, 'day': 5, 'hour': 8, 'minute': 9, 'second': 18, 'weekday': 4, 'timezone': 'America/Los_Angeles'}, 'start_fmt_time': '2022-08-05T08:09:18.517581-07:00', ...
{'_id': ObjectId('644ebf8b7c7a20f08cd4d7dc'), 'user_id': UUID('d83a43a1-df6b-42ed-986f-f5b5f6150221'), 'metadata': {'key': 'analysis/inferred_section', 'platform': 'server', 'write_ts': 1682831597.6901028, 'time_zone': 'America/Los_Angeles', 'write_local_dt': {'year': 2023, 'month': 4, 'day': 29, 'hour': 22, 'minute': 13, 'second': 17, 'weekday': 5, 'timezone': 'America/Los_Angeles'}, 'write_fmt_time': '2023-04-29T22:13:17.690103-07:00'}, 'data': {'source': 'SmoothedHighConfidenceMotion', 'trip_id': ObjectId('644df8dbea199f1d0473e2ff'), 'start_ts': 1659712158.517581, 'start_local_dt': {'year': 2022, 'month': 8, 'day': 5, 'hour': 8, 'minute': 9, 'second': 18, 'weekday': 4, 'timezone': 'America/Los_Angeles'}, 'start_fmt_time': '2022-08-05T08:09:18.517581-07:00', ...
{'_id': ObjectId('644ecddcb4d41475b8330bee'), 'user_id': UUID('d83a43a1-df6b-42ed-986f-f5b5f6150221'), 'metadata': {'key': 'analysis/inferred_section', 'platform': 'server', 'write_ts': 1682831597.6901028, 'time_zone': 'America/Los_Angeles', 'write_local_dt': {'year': 2023, 'month': 4, 'day': 29, 'hour': 22, 'minute': 13, 'second': 17, 'weekday': 5, 'timezone': 'America/Los_Angeles'}, 'write_fmt_time': '2023-04-29T22:13:17.690103-07:00'}, 'data': {'source': 'SmoothedHighConfidenceMotion', 'trip_id': ObjectId('644df8dbea199f1d0473e2ff'), 'start_ts': 1659712158.517581, 'start_local_dt': {'year': 2022, 'month': 8, 'day': 5, 'hour': 8, 'minute': 9, 'second': 18, 'weekday': 4, 'timezone': 'America/Los_Angeles'}, 'start_fmt_time': '2022-08-05T08:09:18.517581-07:00', ...
{'_id': ObjectId('644edbdf3e1640bffa2a6051'), 'user_id': UUID('d83a43a1-df6b-42ed-986f-f5b5f6150221'), 'metadata': {'key': 'analysis/inferred_section', 'platform': 'server', 'write_ts': 1682831597.6901028, 'time_zone': 'America/Los_Angeles', 'write_local_dt': {'year': 2023, 'month': 4, 'day': 29, 'hour': 22, 'minute': 13, 'second': 17, 'weekday': 5, 'timezone': 'America/Los_Angeles'}, 'write_fmt_time': '2023-04-29T22:13:17.690103-07:00'}, 'data': {'source': 'SmoothedHighConfidenceMotion', 'trip_id': ObjectId('644df8dbea199f1d0473e2ff'), 'start_ts': 1659712158.517581, 'start_local_dt': {'year': 2022, 'month': 8, 'day': 5, 'hour': 8, 'minute': 9, 'second': 18, 'weekday': 4, 'timezone': 'America/Los_Angeles'}, 'start_fmt_time': '2022-08-05T08:09:18.517581-07:00', ...
{'_id': ObjectId('644ee98a98049428510657bb'), 'user_id': UUID('d83a43a1-df6b-42ed-986f-f5b5f6150221'), 'metadata': {'key': 'analysis/inferred_section', 'platform': 'server', 'write_ts': 1682831597.6901028, 'time_zone': 'America/Los_Angeles', 'write_local_dt': {'year': 2023, 'month': 4, 'day': 29, 'hour': 22, 'minute': 13, 'second': 17, 'weekday': 5, 'timezone': 'America/Los_Angeles'}, 'write_fmt_time': '2023-04-29T22:13:17.690103-07:00'}, 'data': {'source': 'SmoothedHighConfidenceMotion', 'trip_id': ObjectId('644df8dbea199f1d0473e2ff'), 'start_ts': 1659712158.517581, 'start_local_dt': {'year': 2022, 'month': 8, 'day': 5, 'hour': 8, 'minute': 9, 'second': 18, 'weekday': 4, 'timezone': 'America/Los_Angeles'}, 'start_fmt_time': '2022-08-05T08:09:18.517581-07:00', ...
{'_id': ObjectId('644ef78671732fce1ba61e81'), 'user_id': UUID('d83a43a1-df6b-42ed-986f-f5b5f6150221'), 'metadata': {'key': 'analysis/inferred_section', 'platform': 'server', 'write_ts': 1682831597.6901028, 'time_zone': 'America/Los_Angeles', 'write_local_dt': {'year': 2023, 'month': 4, 'day': 29, 'hour': 22, 'minute': 13, 'second': 17, 'weekday': 5, 'timezone': 'America/Los_Angeles'}, 'write_fmt_time': '2023-04-29T22:13:17.690103-07:00'}, 'data': {'source': 'SmoothedHighConfidenceMotion', 'trip_id': ObjectId('644df8dbea199f1d0473e2ff'), 'start_ts': 1659712158.517581, 'start_local_dt': {'year': 2022, 'month': 8, 'day': 5, 'hour': 8, 'minute': 9, 'second': 18, 'weekday': 4, 'timezone': 'America/Los_Angeles'}, 'start_fmt_time': '2022-08-05T08:09:18.517581-07:00', ...
{'_id': ObjectId('644f05e5635c15953692eeb0'), 'user_id': UUID('d83a43a1-df6b-42ed-986f-f5b5f6150221'), 'metadata': {'key': 'analysis/inferred_section', 'platform': 'server', 'write_ts': 1682831597.6901028, 'time_zone': 'America/Los_Angeles', 'write_local_dt': {'year': 2023, 'month': 4, 'day': 29, 'hour': 22, 'minute': 13, 'second': 17, 'weekday': 5, 'timezone': 'America/Los_Angeles'}, 'write_fmt_time': '2023-04-29T22:13:17.690103-07:00'}, 'data': {'source': 'SmoothedHighConfidenceMotion', 'trip_id': ObjectId('644df8dbea199f1d0473e2ff'), 'start_ts': 1659712158.517581, 'start_local_dt': {'year': 2022, 'month': 8, 'day': 5, 'hour': 8, 'minute': 9, 'second': 18, 'weekday': 4, 'timezone': 'America/Los_Angeles'}, 'start_fmt_time': '2022-08-05T08:09:18.517581-07:00', ...
{'_id': ObjectId('644f13cf3a381d0218807744'), 'user_id': UUID('d83a43a1-df6b-42ed-986f-f5b5f6150221'), 'metadata': {'key': 'analysis/inferred_section', 'platform': 'server', 'write_ts': 1682831597.6901028, 'time_zone': 'America/Los_Angeles', 'write_local_dt': {'year': 2023, 'month': 4, 'day': 29, 'hour': 22, 'minute': 13, 'second': 17, 'weekday': 5, 'timezone': 'America/Los_Angeles'}, 'write_fmt_time': '2023-04-29T22:13:17.690103-07:00'}, 'data': {'source': 'SmoothedHighConfidenceMotion', 'trip_id': ObjectId('644df8dbea199f1d0473e2ff'), 'start_ts': 1659712158.517581, 'start_local_dt': {'year': 2022, 'month': 8, 'day': 5, 'hour': 8, 'minute': 9, 'second': 18, 'weekday': 4, 'timezone': 'America/Los_Angeles'}, 'start_fmt_time': '2022-08-05T08:09:18.517581-07:00', ...
...

I removed those duplicates and tried again, but there appears to be duplicates for all the other sections too.

I am unsure why there are so many duplicates. Do they exist in the dataset? Or did I create duplicates while I was loading in the dataset?

JGreenlee commented 6 months ago

I removed the duplicates manually with this script:

import emission.core.get_database as edb

# get cleaned sections
cleaned = edb.get_analysis_timeseries_db().find({
    'metadata.key': 'analysis/cleaned_section',
})

cleaned_ct = 0
for c in cleaned:
    cleaned_ct += 1
    query_inferred = {
      'metadata.key': 'analysis/inferred_section',
      'user_id': c['user_id'],
      'data.cleaned_section': c['_id']
    }

    first_inferred = edb.get_analysis_timeseries_db().find_one(query_inferred)
    if first_inferred is None:
        print(f"cleaned section {cleaned_ct} had no inferred sections")
        continue

    # remove all of those entries unless the ID is the first inferred section
    dedup_query = {
      'metadata.key': 'analysis/inferred_section',
      'user_id': c['user_id'],
      'data.cleaned_section': c['_id'],
      '_id': {'$ne': first_inferred['_id']}
    }
    delete_result = edb.get_analysis_timeseries_db().delete_many(dedup_query)
    print(f"removed {delete_result.deleted_count} duplicates from cleaned section {cleaned_ct}")

It took awhile to run.

Now I am finally able to test the Segment Trip Time page.

I would expect to see a lot of trips from my house to the library or to the grocery store nearby or to my kids' school. In particular, I would expect to see at least 100 trips from my house to the local school.

For trips from home to school, I found 222 trips spanning from July 2022 to December 2023. This seems to align with expectations because there are about 180 school days in 1 year.

The boxes I used were about the size of 1 block. I will follow up with smaller boxes

JGreenlee commented 6 months ago

Home to school

\ Home to school with boxes about the size of 1 block, got 222 trips

\ Home to school with boxes about half that size, 197 trips (still pretty good)

The trips span from July 2022 to December 2023. This seems to align with expectations because there are about 180 school days in 1 year. Based on the dates, it generally makes sense – large gaps are observed for summer break etc \

So I do think this is probably a pretty comprehensive measure of this repeated trip

Home to viola class

\ Home to viola class, using smaller boxes, 52 trips

\ This trip is observed almost every week, sometimes twice in the same day but not always

\ 1 "bicycling" trip to viola class which took 18 minutes - I think this was probably just mislabeled

Methodology for drawing boxes

I found these usage guidelines quite helpful and accurate: \

For reference, below are the heatmaps around those 3 areas of interest. I drew the boxes considering where the locus of activity seems to be for each area (and considering the guidelines above)

(I wonder if this tool could potentially be even more useful + easy to use if a heatmap was actually overlayed on the start/end selection area? I found myself switching back and forth often)

\ \ \

Conclusion

Based on this, the tool does appear to work as expected. It captured a fairly comprehensive, if not fully comprehensive, picture of the above recurring trips. I also briefly validated the tool against my own travel data. I think the instructions for usage are clear as well.

The only changes I might suggest would include a heatmap overlay to make it easier to identify places of activity while drawing boxes, and potentially a toggle to "swap" start and end locations. (I see a common use case where the user has seen the duration from A to B - now they want to see the duration from B to A, but they don't want to have to re-draw the boxes)

shankari commented 6 months ago

@JGreenlee thanks for the comprehensive review! Given the length of time that this has been pending, I will merge the changes now for the next round and we can address the UX improvements in a subsequent round.

JGreenlee commented 6 months ago

@shankari Great! ~~When this is merged, the Dockerfile must be updated with a more recent image of e-mission-server~~

While trying this branch, I initially got this error:
AttributeError: module 'emission.storage.decorations.section_queries' has no attribute 'cleaned2inferred_section_list'
I realized this is because this branch was using an old image of e-mission-server (shankari/e-mission-server:gis-based-mode-detection_2023-04-21--54-09). Which was probably before cleaned2inferred_section_list was added.

Updated to the most recent (shankari/e-mission-server:master_2024-02-10--19-38) and rebuilt.

JGreenlee commented 6 months ago

Actually it looks like you just did that a few days ago!

JGreenlee commented 5 months ago

I resolved the merge conflicts for this feature and updated it to observe the global filters (which were added since this feature was created).

apply date range & uuid filters to the 'segment trip time' page

The 'segment trip time' page was written before we had the global filters implemented. So we need to patch them in now to get this feature up to speed and merged.

Datepicker values and excluded uuids (if any) are passed through to the query_segments_crossing_endpoints function. Within this function is the DB call. We pass a time_query arg based on the selected dates + timezone. We also pass a query in the extra_query_list arg that will exclude any entries where the user_id is found in the excluded_uuids list.

@shankari These changes are on my branch https://github.com/JGreenlee/op-admin-dashboard/tree/segment_trip_time_resolved_conflicts

e-mission / op-admin-dashboard