✨ Incorporate the sensed/inferred/BLE sensed primary modes and survey results

shankari commented 1 month ago

…rip table

My original thought was that this would be a simple 2-3 line change where I would enable the cleaned/inferred mode summaries in the set of valid columns.

However, while actually implementing it, I ran into several issues:

we display the primary sensed mode in the public dashboard, so we can't just display the summary, we have to extract the primary mode
I then switched to using the primary mode computation from the public dashboard, only to discover that it needs the summary as an object. However, the previous implementation of this code used pd.json_normalize, which resulted in columns like

       'data.inferred_section_summary.distance.TRAIN',
       'data.inferred_section_summary.duration.TRAIN',
       'data.inferred_section_summary.count.TRAIN',
       'data.inferred_section_summary.distance.TRAM',
       'data.inferred_section_summary.duration.TRAM',
       'data.inferred_section_summary.count.TRAM'],

and not the expected

'inferred_section_summary': {'distance': {'CAR': ...}}

To workaround this, I switched to using get_data_df, as we do in the public dashboard. This gave us the correct objects but changed the names of the columns so that they no longer had the data prefix. This could lead to backwards compat issues with the excluded columns from the configs.

      "data_trips_columns_exclude": ["data.start_loc.coordinates", "data.end_loc.coordinates"],

So the current solution is:

use data_df
rename the columns using a dynamically generated mapper for backwards compat
find the primary mode in the summaries
enable the primary modes as valid columns

The rest of the code is untouched

Testing done: See screenshots in PR

Note further that the list of valid cols has data.start_local_dt and data.end_local_dt however, the actual columns are end_local_dt_day etc so are not renamed. I double checked, and I am not sure how this ever worked - pd.json_normalize also returns

data.end_local_dt.year
data.end_local_dt.month
data.end_local_dt.day
data.end_local_dt.hour

and I have verified that the columns don't show up on staging

@JGreenlee @louisg1337 here's another area to unify data handling.

Note finally that I also tried enabling the full summary objects, but they made the table very long and fairly confusing, so I removed them again. They can be enabled by editing utils/constants.py again

shankari commented 1 month ago

Staging (without changes):

With changes:

shankari commented 1 month ago

We can see both the primary mode and the labeled modes

shankari commented 1 month ago

Screenshot of the full summary

shankari commented 1 month ago

Note that this still has a regression, in that we don't use the user-friendly names, but the raw column names, such as data.end_fmt_time. We need to make sure that the mappings work.

shankari commented 1 month ago

Ah this is not a regression from this change - we see the same value on staging. Spending ~ 10 mins figuring out whether I can fix this...

shankari commented 1 month ago

This is also extremely messy. If we add additional log statements, we see

DEBUG:root:Before filtering, df columns are Index(['source', 'end_ts', 'end_fmt_time', 'end_loc', 'raw_trip', 'start_ts',
       'start_fmt_time', 'start_loc', 'duration', 'distance', 'start_place',
       'end_place', 'cleaned_trip', 'inferred_labels', 'inferred_trip',
       'expectation', 'confidence_threshold', 'expected_trip',
       'inferred_section_summary', 'cleaned_section_summary', 'user_input',
       'additions', 'start_local_dt_year', 'start_local_dt_month',
       'start_local_dt_day', 'start_local_dt_hour', 'start_local_dt_minute',
       'start_local_dt_second', 'start_local_dt_weekday',
       'start_local_dt_timezone', 'end_local_dt_year', 'end_local_dt_month',
       'end_local_dt_day', 'end_local_dt_hour', 'end_local_dt_minute',
       'end_local_dt_second', 'end_local_dt_weekday', 'end_local_dt_timezone',
       '_id', 'user_id', 'metadata_write_ts', 'confirmed_place'],
      dtype='object')
DEBUG:root:Rename mapping is {'start_local_dt': 'data.start_local_dt', 'start_fmt_time': 'data.start_fmt_time', 'end_local_dt': 'data.end_local_dt', 'end_fmt_time': 'data.end_fmt_time', 'duration': 'data.duration', 'duration_seconds': 'data.duration_seconds', 'distance': 'data.distance', 'distance_km': 'data.distance_km', 'distance_miles': 'data.distance_miles', 'distance_meters': 'data.distance_meters', 'start_loc.coordinates': 'data.start_loc.coordinates', 'end_loc.coordinates': 'data.end_loc.coordinates', 'primary_sensed_mode': 'data.primary_sensed_mode', 'primary_predicted_mode': 'data.primary_predicted_mode', 'primary_ble_sensed_mode': 'data.primary_ble_sensed_mode', 'mode_confirm': 'mode_confirm', 'purpose_confirm': 'purpose_confirm', 'replaced_mode': 'replaced_mode', 'user_id': 'user_id'}
DEBUG:root:After renaming columns, they are Index(['source', 'end_ts', 'data.end_fmt_time', 'end_loc', 'raw_trip',
       'start_ts', 'data.start_fmt_time', 'start_loc', 'data.duration',
       'data.distance', 'start_place', 'end_place', 'cleaned_trip',
       'inferred_labels', 'inferred_trip', 'expectation',
       'confidence_threshold', 'expected_trip', 'inferred_section_summary',
       'cleaned_section_summary', 'user_input', 'additions',
       'start_local_dt_year', 'start_local_dt_month', 'start_local_dt_day',
       'start_local_dt_hour', 'start_local_dt_minute', 'start_local_dt_second',
       'start_local_dt_weekday', 'start_local_dt_timezone',
       'end_local_dt_year', 'end_local_dt_month', 'end_local_dt_day',
       'end_local_dt_hour', 'end_local_dt_minute', 'end_local_dt_second',
       'end_local_dt_weekday', 'end_local_dt_timezone', '_id', 'user_id',
       'metadata_write_ts', 'confirmed_place'],
      dtype='object')
DEBUG:root:No BLE support found, not fleet version, ignoring...
DEBUG:root:allowed trip columns are {'replaced_mode', 'data.distance', 'data.end_fmt_time', 'data.duration_seconds', 'mode_confirm', 'user_id', 'purpose_confirm', 'data.start_local_dt', 'data.end_local_dt', 'data.primary_sensed_mode', 'data.primary_ble_sensed_mode', 'data.distance_km', 'data.distance_miles', 'data.duration', 'cleaned_section_summary', 'data.start_fmt_time', 'data.primary_predicted_mode', 'data.distance_meters'}
DEBUG:root:get_all_trip_columns: after additional columns, curr set is {'replaced_mode', 'data.end_local_dt', 'data.distance', 'data.end_fmt_time', 'data.primary_sensed_mode', 'data.duration_seconds', 'mode_confirm', 'user_id', 'data.primary_ble_sensed_mode', 'data.distance_km', 'data.distance_miles', 'data.duration', 'cleaned_section_summary', 'data.start_fmt_time', 'purpose_confirm', 'data.start_local_dt', 'data.start_loc.coordinates', 'data.end_loc.coordinates', 'data.primary_predicted_mode', 'data.distance_meters'}
DEBUG:root:Handling named_col={'label': 'trip_start_time_str', 'path': 'data.start_fmt_time'}
DEBUG:root:Handling named_col={'label': 'trip_end_time_str', 'path': 'data.end_fmt_time'}
DEBUG:root:Handling named_col={'label': 'start_coordinates', 'path': 'data.start_loc.coordinates'}
DEBUG:root:Handling named_col={'label': 'end_coordinates', 'path': 'data.end_loc.coordinates'}
DEBUG:root:After renaming to user friendly values, we get df.columns=Index(['replaced_mode', 'data.distance', 'data.end_fmt_time',
       'data.primary_sensed_mode', 'mode_confirm', 'user_id', 'data.duration',
       'cleaned_section_summary', 'data.start_fmt_time', 'purpose_confirm',
       'data.start_loc.coordinates', 'data.end_loc.coordinates',
       'data.primary_predicted_mode', 'trip_start_time_str',
       'trip_end_time_str', 'start_coordinates', 'end_coordinates'],
      dtype='object')
DEBUG:root:After filtering, the actual data is 0    2023-07-25T10:15:37.414862+10:00
1    2023-07-22T19:11:27.819006-07:00
2    2023-07-22T15:09:01.327622-07:00
3    2023-07-22T14:19:03.412194-07:00
4    2023-07-22T20:07:06.645749-07:00
Name: trip_start_time_str, dtype: object
INFO:werkzeug:192.168.65.1 - - [08/Jul/2024 00:36:32] "POST /_dash-update-component HTTP/1.1" 200 -
DEBUG:root:allowed trip columns are {'replaced_mode', 'data.distance', 'data.end_fmt_time', 'data.duration_seconds', 'mode_confirm', 'user_id', 'purpose_confirm', 'data.start_local_dt', 'data.end_local_dt', 'data.primary_sensed_mode', 'data.primary_ble_sensed_mode', 'data.distance_km', 'data.distance_miles', 'data.duration', 'cleaned_section_summary', 'data.start_fmt_time', 'data.primary_predicted_mode', 'data.distance_meters'}

So it looks like the issue is that the renamed cols are not allowed. But there's a bunch of interlocking get_XXX calls in the permissions module which is extremely messy and just crying out for a rewrite. I am fixing this with another hack (adding a call to get_required_named_trip_columns to get_allowed_trip_columns), but we should really clean up and simplify this logic.

I got that to work, but now we have two copies of the data

DEBUG:root:After renaming to user friendly values, we get df.columns=Index(['data.start_loc.coordinates', 'data.duration',
       'data.primary_sensed_mode', 'purpose_confirm', 'replaced_mode',
       'data.primary_predicted_mode', 'data.start_fmt_time',
       'data.end_loc.coordinates', 'data.distance', 'data.end_fmt_time',
       'user_id', 'cleaned_section_summary', 'mode_confirm',
       'trip_start_time_str', 'trip_end_time_str', 'start_coordinates',
       'end_coordinates'],
      dtype='object')
DEBUG:root:After filtering, the actual data is 0    2023-07-25T10:15:37.414862+10:00
1    2023-07-22T19:11:27.819006-07:00
2    2023-07-22T15:09:01.327622-07:00
3    2023-07-22T14:19:03.412194-07:00
4    2023-07-22T20:07:06.645749-07:00
Name: trip_start_time_str, dtype: object
INFO:werkzeug:192.168.65.1 - - [08/Jul/2024 00:53:22] "GET /_reload-hash HTTP/1.1" 200 -
INFO:werkzeug:192.168.65.1 - - [08/Jul/2024 00:53:22] "POST /_dash-update-component HTTP/1.1" 200 -
DEBUG:root:allowed trip columns are {'data.duration_seconds', 'data.duration', 'data.primary_sensed_mode', 'data.primary_ble_sensed_mode', 'replaced_mode', 'data.primary_predicted_mode', 'user_id', 'cleaned_section_summary', 'data.distance_miles', 'mode_confirm', 'data.start_local_dt', 'data.end_local_dt', 'purpose_confirm', 'trip_end_time_str', 'data.start_fmt_time', 'trip_start_time_str', 'data.distance', 'data.end_fmt_time', 'start_coordinates', 'end_coordinates', 'data.distance_km', 'data.distance_meters'}

I note also that after setting the value, we used to drop the old value, but apparently that was commented out https://github.com/e-mission/op-admin-dashboard/blob/3c5c2c51643b2a16aa4201318cb4ddfc930142cb/utils/db_utils.py#L89

I am not sure if this ever worked, and I don't want to muck around with this any more (this was already a last minute change to support a deployer). It did not work although somebody has refactored and removed by original get_display_columns, which would have been somewhat more meaningfully named https://github.com/e-mission/op-admin-dashboard/issues/29#issuecomment-1530902392

and although I have commented out the code to drop the paths, I don't see the paths and only the labels. This seems to be working (sort of) for now, so will defer a more principled fix until @AlirezaRa94 returns

Dropping this for now until somebody takes a giant weed whacker to this and makes it nice.

shankari commented 1 month ago

One final fix - right now, we have the mode/purpose in the list of valid columns, so this won't work for survey results, such as GSA GPG. Since I am in here anyway, I am now going to see if I can change that so have all the user input columns be added. Once that is done, I am going to declare victory on this PR and move on.

shankari commented 1 month ago

Testing screenshots showing both labels and survey responses in the trip table.

shankari commented 1 month ago

I declare these changes done. @JGreenlee @louisg1337 @abby-wheelis LMK if any of you wants to review this; otherwise I will plan on merging on Monday

e-mission / op-admin-dashboard

✨ Incorporate the sensed/inferred/BLE sensed primary modes and survey results #118