Patches: nan summaries breaking some deployments, filter land modes on underlying (not display) label

Abby-Wheelis commented 1 month ago

To address #140 by updating the way we filter the data from display mode to underlying mode

What I am now using is mode_confirm, which should work for now (all of the label sets we have now using the label "air", but I think using the baseMode from the label file, which would be AIR, but that change will be somewhat more complicated.

If I figure out why there are so many Unknown ble mode entries in dfc-fermata, I'll fix that in this PR as well.

Abby-Wheelis commented 1 month ago

Looking into what's happening with dfc-fermata by comparing May in the public and admin dashboards, and the results are the same. There are about 63% UNKNOWN trips on both dashboards for May, so the display is accurate.

We should keep looking into why there are so many trips that are not connected to a mode from a BLE beacon

Abby-Wheelis commented 1 week ago

Testing with open-access data for the patch to #142

Screenshot 2024-09-06 at 11 47 43 AM

Abby-Wheelis commented 1 week ago

So... air -> Air is not in our current csv way of translating the labels, so they get grouped as "Other"

Great advocacy for getting those labels unified as soon as we can.

I think the quickest fix would be to add "air -> Air" to the csv, flagging that we should get the labels unified (I would try to do that here, but I think #148 / #143 needs to be completed first to prevent mode chaos. The we can retire the csvs as soon as possible as a follow-on/clean-up to the unification of colors (which is proving to require more work with labels than we first thought)

Abby-Wheelis commented 1 week ago

@iantei feel free to chime in on what you think about my approach to add air->Air to the csv for now, with the plan of unifying the labels and filtering on the base mode as soon as feasible.

Abby-Wheelis commented 1 week ago

This fixes #140 and #142

iantei commented 1 week ago

I think the approach to add air->Air to the csv would cover up for the default mapping case, which uses the csv for the mapping. For the dynamic labels, currently we have usaid-laos-ev, godcgo-label-options label_options which have MODE.value = air, but caebike-label-options doesn't have MODE.value = air. Would the above change also cover for this program?

An alternative temporary solution, which I can also think about is to modify the merge_small_entries(labels, values) function to restrict the conversion of labels = air from air to Other. I think introducing new label on csv is also a good solution.

iantei commented 1 week ago

The code changes look good! Could you please test for the caebike-label-options program too?

Abby-Wheelis commented 1 week ago

@iantei since air is not an option in that program, I would expect any airplane trips to have a user-entered label and so they would get grouped as other, I'll test with that dataset just to be sure!

Abby-Wheelis commented 1 week ago

The two labeled charts look the same, despite the fact that there are quite a few trips that do seem to be "air" - but are user-custom labels. I wonder if ca-ebike would like to add the air mode? It would make their distance charts more representative (and not overwhelmed by "Other")

Abby-Wheelis commented 1 week ago

Counts are also largely "Other" - IIRC from the data, much of this might be testing related.

shankari commented 1 week ago

Can we get the breakdown of the "Other" to verify this?

At a high level, in a future fix, I think we should consider user labels if they represent a large enough proportion of the aggregate data. So if 40% of the data is "Other" and 90% of the "Other" is "Air", maybe we should just show "Air". But that is going to require a lot of judgement calls - what is "large enough proportion"? what if the text entered by users is subtly different? so let's punt on that for now.

Abby-Wheelis commented 1 week ago

Screenshot 2024-09-07 at 10 42 15 AM Screenshot 2024-09-07 at 10 37 36 AM

The labeled trips at least are not air. Labels not in the dynamic labels include - drove_alone, shared_ride, bus, and e_car_shared_ride - all of the data that I have in this dump is from test users, and I'm guessing a lot of it is from before they added their custom labels.

The air/HSR trips (39/79,000 miles) must be unlabeled.

shankari commented 1 week ago

all of the data that I have in this dump is from test users, and I'm guessing a lot of it is from before they added their custom labels.

Yup, makes sense.

Abby-Wheelis commented 1 week ago

So aside from the tabled point of

At a high level, in a future fix, I think we should consider user labels if they represent a large enough proportion of the aggregate data. So if 40% of the data is "Other" and 90% of the "Other" is "Air", maybe we should just show "Air". But that is going to require a lot of judgement calls - what is "large enough proportion"? what if the text entered by users is subtly different? so let's punt on that for now.

This seems to be working as expected, and can serve as a patch pending the more principled unification of labels, and usage of base modes (I added to the csv, and we still filter on mode_confirm, should update to baseMode when we get rid of the csvs)

shankari commented 6 days ago

I'm not going to insist on this right now, but in a cleanup change, I think it would be helpful to maybe have a unified layer that handles both dashboards. Let's wait until the carbon calculations are integrated into the public dashboard, though. If the unified layer is small and generic enough, it could even go into the server codebase, so that anybody else working with mongodumps (e.g. TSDC) could use it as well.

Abby-Wheelis commented 5 days ago

I have tested with marking the entries that are malformated as INVALID as we do on the admin dash, but if I do that, they get grouped as OTHER since INVALID is not a valid sensed mode, but later when we merge small entries, they get marked as Other, leaving both OTHER and Other on the chart. I see 3 paths forward with this:

1) Mark the data as UNKNOWN rather than INVALID (arguably, the mode is unknown, but because the data is invalid) 2) add INVALID to the list of valid modes - it might become a segment in the chart, and would flag to us and program admins that there is an issue with the data 3) Continue to allow INVALID to be marked for display as OTHER, but prevent OTHER and Other from appearing on the same chart (merge small entries into OTHER for sensed modes, not Other

I'll collect opinions at our meeting later today!

Abby-Wheelis commented 4 days ago

They should be marked as INVALID and I should fix OTHER vs Other

Abby-Wheelis commented 4 days ago

What the chart with invalid data will look like now: What it would look like if INVALID were OTHER (to test if existing OTHER gets pooled with the small entries)

e-mission / em-public-dashboard

Patches: nan summaries breaking some deployments, filter land modes on underlying (not display) label #141