Open GabrielKS opened 3 years ago
First, these numbers are likely wrong, because although I build the model with a radius of 500 https://github.com/e-mission/e-mission-server/pull/829/files#diff-5281eebce1b462a2a39465cd785e4f36572ec618a3a79efbdfdf35fc508a9c90R64
I forgot to change the radius for the prediction to 100 https://github.com/e-mission/e-mission-server/pull/829/files#diff-18a304dace1163481f6faf1cd707af237b2d6766b4508e73caeacb1f51056b48R65
We should re-analyse after doing that.
To re-run after changing the radius locally, use https://github.com/e-mission/e-mission-server/pull/829#issuecomment-892186091
First, I don't think that the results on the labeled data are meaningful, because we are essentially testing the model on the training data. From the aggregate results, focusing only on unlabeled, it seems like we want to have a threshold of something like 0.4
? That will ensure that most trips that have labels will show them instead of being converted to red
labels but will still filter out the very low quality labels.
Thinking out loud, we have a lot of unlabeled trips with no inferences, and, based on https://github.com/e-mission/e-mission-eval-private-data/pull/28#issuecomment-887294253 there is significant variability between the labeling % for users, primarily determined by how many labeled trips we already have.
So we may want to briefly plot out a per-user histogram, or focus only on users with more than 50-100 trips because if people have too few trips, we can tell them that we won't be able to predict.
Or maybe have two different distributions for people with > 20% labeled v/s < 20% labeled etc
@GabrielKS one challenge with configuring with 0.4 is the same_mode
issue. We used to allow same_mode
as an option for the replaced mode and at least some times, the percentage goes below 0.4 because of slight differences there - e.g. see below.
[{'labels': {'mode_confirm': 'drove_alone',
'purpose_confirm': 'home',
'replaced_mode': 'drove_alone'},
'p': 0.3333333333333333},
{'labels': {'mode_confirm': 'drove_alone',
'purpose_confirm': 'home',
'replaced_mode': 'same_mode'},
'p': 0.08333333333333333},
{'labels': {'mode_confirm': 'drove_alone',
'purpose_confirm': 'shopping',
'replaced_mode': 'drove_alone'},
'p': 0.16666666666666666},
{'labels': {'mode_confirm': 'drove_alone',
'purpose_confirm': 'shopping',
'replaced_mode': 'same_mode'},
'p': 0.08333333333333333},
{'labels': {'mode_confirm': 'shared_ride',
'purpose_confirm': 'home',
'replaced_mode': 'drove_alone'},
'p': 0.25},
{'labels': {'mode_confirm': 'shared_ride',
'purpose_confirm': 'shopping',
'replaced_mode': 'drove_alone'},
'p': 0.08333333333333333}]
Remapping same_mode
to the same value as mode_confirm
should solve that problem.
I removed the same_mode
label from the UI on June 5th
https://github.com/e-mission/e-mission-phone/commit/1806e6c72e9f5262c03bec230ce12971296ffd0f#diff-8f7e0cbf2ba6bd210c65bfcac14614c6fabfd3bd95b99f6d2974c615ddcef159
So none of the actual participants would have ever selected same mode
But we should handle it on the staging server before tuning.
The Jupyter Notebook I used for analysis is now pushed (without the results) to https://github.com/GabrielKS/e-mission-eval-private-data/tree/inference_confidence_analysis.
Redoing the inference myself with the existing 100m radius threshold, I get slightly different numbers, perhaps because there were trips added to the dataset after inference had been run:
Probability distribution of all fully labeled:
Probability: number of trips (percentage of trips)
{
1.000: 718 (37.20%)
0.900: 10 (0.52%)
0.875: 8 (0.41%)
0.833: 20 (1.04%)
0.714: 2 (0.10%)
0.667: 12 (0.62%)
0.571: 6 (0.31%)
0.556: 7 (0.36%)
0.500: 63 (3.26%)
0.429: 12 (0.62%)
0.407: 27 (1.40%)
0.400: 25 (1.30%)
0.367: 20 (1.04%)
0.333: 22 (1.14%)
0.300: 21 (1.09%)
0.286: 7 (0.36%)
0.235: 15 (0.78%)
0.000: 935 (48.45%)
}
Probability distribution of all fully unlabeled:
Probability: number of trips (percentage of trips)
{
1.000: 89 (3.40%)
0.900: 8 (0.31%)
0.875: 9 (0.34%)
0.833: 19 (0.73%)
0.714: 1 (0.04%)
0.556: 2 (0.08%)
0.500: 66 (2.52%)
0.429: 57 (2.18%)
0.400: 83 (3.17%)
0.333: 1 (0.04%)
0.286: 1 (0.04%)
0.000: 2284 (87.18%)
}
Changing the 100m radius threshold to 500m, I get significantly different results:
Probability distribution of all fully labeled:
Probability: number of trips (percentage of trips)
{
1.000: 751 (39.69%)
0.917: 12 (0.63%)
0.900: 10 (0.53%)
0.875: 8 (0.42%)
0.833: 24 (1.27%)
0.800: 10 (0.53%)
0.786: 14 (0.74%)
0.778: 9 (0.48%)
0.714: 7 (0.37%)
0.700: 10 (0.53%)
0.692: 13 (0.69%)
0.667: 45 (2.38%)
0.600: 15 (0.79%)
0.583: 23 (1.22%)
0.571: 6 (0.32%)
0.556: 27 (1.43%)
0.545: 11 (0.58%)
0.531: 23 (1.22%)
0.500: 179 (9.46%)
0.455: 11 (0.58%)
0.452: 42 (2.22%)
0.429: 21 (1.11%)
0.419: 42 (2.22%)
0.407: 27 (1.43%)
0.400: 30 (1.59%)
0.393: 159 (8.40%)
0.375: 8 (0.42%)
0.367: 30 (1.59%)
0.360: 25 (1.32%)
0.333: 78 (4.12%)
0.312: 32 (1.69%)
0.300: 40 (2.11%)
0.294: 17 (0.90%)
0.286: 7 (0.37%)
0.250: 58 (3.07%)
0.235: 15 (0.79%)
0.222: 6 (0.32%)
0.200: 18 (0.95%)
0.000: 29 (1.53%)
}
Probability distribution of all fully unlabeled:
Probability: number of trips (percentage of trips)
{
1.000: 172 (7.06%)
0.917: 1 (0.04%)
0.900: 8 (0.33%)
0.875: 9 (0.37%)
0.833: 20 (0.82%)
0.800: 4 (0.16%)
0.786: 17 (0.70%)
0.750: 3 (0.12%)
0.714: 11 (0.45%)
0.667: 2 (0.08%)
0.583: 3 (0.12%)
0.556: 12 (0.49%)
0.531: 8 (0.33%)
0.500: 53 (2.18%)
0.452: 6 (0.25%)
0.429: 66 (2.71%)
0.419: 76 (3.12%)
0.400: 102 (4.19%)
0.393: 2 (0.08%)
0.360: 5 (0.21%)
0.333: 75 (3.08%)
0.300: 1 (0.04%)
0.294: 4 (0.16%)
0.286: 1 (0.04%)
0.250: 35 (1.44%)
0.000: 1740 (71.43%)
}
Graphs for 500m:
Many (544) more unlabeled trips have some sort of inference now. There is no longer nearly so strong a cutoff at 0.4.
From this, it looks like the threshold should be 0.25, but then, we will basically not exclude anything. Do you get different results on a per-user basis, or only looking at users with lots of trips?
wrt:
But we should handle it on the staging server before tuning.
the obvious fix would be to change the inputs in the database and re-run the pipeline. An alternate solution would be to fix it in the code, but that would require special casing the handling of replaced mode instead of working with user inputs generically.
There is existing code to map the replaced mode with same_mode
(https://github.com/e-mission/e-mission-server/pull/829/files#diff-c7ece2e6b65a06d6fd262e2ca047f676b4050b865a1b2a1b3f91a85b72ca5460R48) but unfortunately, it is only called from the second round for now. And of course, it is specific to the replaced mode. We could re-introduce that for now instead of modifying the user inputs.
That middle graph if we only consider users who have labeled at least 10 or 20 trips (which is 15/39=38% of users; there are no users who have labeled between 10 and 20 trips): If we only consider those who have labeled at least 50 trips (9/39=23% of users): I don't see that this suggests any obvious way forward.
@GabrielKS back of the envelope estimate of the difference https://github.com/e-mission/e-mission-eval-private-data/pull/28#issuecomment-892420975
Some actual values are:
cluster_label | before_unique_combo_len | after_unique_combo_len | before_max_p | after_max_p | |
---|---|---|---|---|---|
25 | 0 | 10 | 9 | 0.423077 | 0.615385 |
26 | 1 | 5 | 4 | 0.545455 | 0.727273 |
27 | 2 | 6 | 4 | 0.4 | 0.5 |
29 | 4 | 6 | 5 | 0.166667 | 0.333333 |
32 | 7 | 2 | 1 | 0.666667 | 1 |
@GabrielKS If the out-and-back errors are common, both on staging and on the real deployments, I think I might be able to come up with a way to fix at least that pattern automatically. But it will take me ~ 3-4 days with no time for anything else.
Detecting the pattern (as opposed to fixing it) is much easier - I've basically already implemented it.
I think it might be worthwhile to work this into the expectation code somehow, maybe to mark trips as "double check". Let's discuss at today's meeting.
The middle graph with a threshold of 50 trips after the same_mode
mapping was reenabled:
Although this is going to make the threshold meaningless at this point, I think we should go with 25% as the threshold aka show all the trips. This is because although the number of trips affected is low, the number of trips for which we have inferences at all is also low. I think it is more important to give people the sense that we're doing something than it is to be perfect with the accuracy. We can always tune this after the first two weeks if needed although analysing those results will be in the future work.
Just for the record, the value in here don't seem to change much https://github.com/e-mission/e-mission-docs/issues/656#issuecomment-893024116
but when I did the side by side comparison (with boxplots), I got a pretty significant change: https://github.com/e-mission/e-mission-eval-private-data/pull/28#issuecomment-892724090
One difference between the two is that @GabrielKS is looking at the matched inferences, while I am looking at the clusters in the model. So maybe this is skewed by the fact that there aren't a lot of matches?
So using the actual inferences allows us to look at what the impact to this set of users and this trip history would be. Looking at the clusters directly, we get what could happen if we had better matching.
But that also seems to argue for something between 20% and 40%, so I am happy with 25%
To be able to make an informed decision on what confidence threshold we should use in the staging test of the new Label UI, I did some analysis of what label inference confidence looked like in the staging data. I first chose an individual user known to label many of their trips. Looking only at labeled trips, I found that the most likely inferences generated (i.e., the label tuple with the highest probability in the inference data structure) fell into 7 buckets. Here, trips with an empty inference data structure were counted as having a probability of 0.
I then compared these stated probability values to the fraction of trips in each bucket for which the inference actually matched the user labels:
Presumably the reason for such a close correspondence here is due to how the clustering algorithm behaves having been trained on this data.
I then did some analysis on all confirmed trips across all users in the staging dataset. 2046 of these confirmed trips were fully labeled by users and had the following inference probability distribution:
3155 confirmed trips were fully unlabeled by users and had the following inference probability distribution:
142 confirmed trips were partially labeled by users — i.e., the user filled in some of the labels for the trip but not all of them.
Here are some graphs visualizing this data:
From the graphs, we see that a significant fraction (85.6%) of the unlabeled trips have no inference at all, more so than for labeled trips (49.5%). There are also more labeled trips with 100% certainty (36.6%) than unlabeled (4.7%). However, aside from these endpoints, the trend is reversed — unlabeled trips tend to cluster towards the middle and upper end of the probability spectrum, whereas labeled trips are more evenly distributed.