Open Abby-Wheelis opened 1 month ago
Looking into what's happening with dfc-fermata
by comparing May in the public and admin dashboards, and the results are the same. There are about 63% UNKNOWN
trips on both dashboards for May, so the display is accurate.
We should keep looking into why there are so many trips that are not connected to a mode from a BLE beacon
Testing with open-access
data for the patch to #142
So... air -> Air is not in our current csv way of translating the labels, so they get grouped as "Other"
Great advocacy for getting those labels unified as soon as we can.
I think the quickest fix would be to add "air -> Air" to the csv, flagging that we should get the labels unified (I would try to do that here, but I think #148 / #143 needs to be completed first to prevent mode chaos. The we can retire the csvs as soon as possible as a follow-on/clean-up to the unification of colors (which is proving to require more work with labels than we first thought)
@iantei feel free to chime in on what you think about my approach to add air->Air to the csv for now, with the plan of unifying the labels and filtering on the base mode as soon as feasible.
This fixes #140 and #142
I think the approach to add air->Air to the csv would cover up for the default mapping case, which uses the csv for the mapping.
For the dynamic labels, currently we have usaid-laos-ev
, godcgo-label-options
label_options which have MODE.value = air
, but caebike-label-options
doesn't have MODE.value = air
.
Would the above change also cover for this program?
An alternative temporary solution, which I can also think about is to modify the merge_small_entries(labels, values)
function to restrict the conversion of labels = air
from air
to Other
.
I think introducing new label on csv is also a good solution.
The code changes look good! Could you please test for the caebike-label-options
program too?
@iantei since air
is not an option in that program, I would expect any airplane trips to have a user-entered label and so they would get grouped as other, I'll test with that dataset just to be sure!
The two labeled charts look the same, despite the fact that there are quite a few trips that do seem to be "air" - but are user-custom labels. I wonder if ca-ebike
would like to add the air mode? It would make their distance charts more representative (and not overwhelmed by "Other")
Counts are also largely "Other" - IIRC from the data, much of this might be testing related.
Can we get the breakdown of the "Other" to verify this?
At a high level, in a future fix, I think we should consider user labels if they represent a large enough proportion of the aggregate data. So if 40% of the data is "Other" and 90% of the "Other" is "Air", maybe we should just show "Air". But that is going to require a lot of judgement calls - what is "large enough proportion"? what if the text entered by users is subtly different? so let's punt on that for now.
The labeled trips at least are not air. Labels not in the dynamic labels include - drove_alone, shared_ride, bus, and e_car_shared_ride - all of the data that I have in this dump is from test users, and I'm guessing a lot of it is from before they added their custom labels.
The air/HSR trips (39/79,000 miles) must be unlabeled.
all of the data that I have in this dump is from test users, and I'm guessing a lot of it is from before they added their custom labels.
Yup, makes sense.
So aside from the tabled point of
At a high level, in a future fix, I think we should consider user labels if they represent a large enough proportion of the aggregate data. So if 40% of the data is "Other" and 90% of the "Other" is "Air", maybe we should just show "Air". But that is going to require a lot of judgement calls - what is "large enough proportion"? what if the text entered by users is subtly different? so let's punt on that for now.
This seems to be working as expected, and can serve as a patch pending the more principled unification of labels, and usage of base modes (I added to the csv, and we still filter on mode_confirm
, should update to baseMode
when we get rid of the csvs)
I'm not going to insist on this right now, but in a cleanup change, I think it would be helpful to maybe have a unified layer that handles both dashboards. Let's wait until the carbon calculations are integrated into the public dashboard, though. If the unified layer is small and generic enough, it could even go into the server codebase, so that anybody else working with mongodumps (e.g. TSDC) could use it as well.
I have tested with marking the entries that are malformated as INVALID
as we do on the admin dash, but if I do that, they get grouped as OTHER
since INVALID
is not a valid sensed mode, but later when we merge small entries, they get marked as Other
, leaving both OTHER
and Other
on the chart. I see 3 paths forward with this:
1) Mark the data as UNKNOWN
rather than INVALID
(arguably, the mode is unknown, but because the data is invalid)
2) add INVALID
to the list of valid modes - it might become a segment in the chart, and would flag to us and program admins that there is an issue with the data
3) Continue to allow INVALID
to be marked for display as OTHER
, but prevent OTHER
and Other
from appearing on the same chart (merge small entries into OTHER
for sensed modes, not Other
I'll collect opinions at our meeting later today!
They should be marked as INVALID
and I should fix OTHER
vs Other
What the chart with invalid data will look like now:
What it would look like if INVALID
were OTHER
(to test if existing OTHER
gets pooled with the small entries)
To address #140 by updating the way we filter the data from display mode to underlying mode
What I am now using is
mode_confirm
, which should work for now (all of the label sets we have now using the label "air", but I think using thebaseMode
from the label file, which would beAIR
, but that change will be somewhat more complicated.If I figure out why there are so many
Unknown
ble mode entries indfc-fermata
, I'll fix that in this PR as well.