Analysis of label inference confidence

GabrielKS commented 3 years ago

To be able to make an informed decision on what confidence threshold we should use in the staging test of the new Label UI, I did some analysis of what label inference confidence looked like in the staging data. I first chose an individual user known to label many of their trips. Looking only at labeled trips, I found that the most likely inferences generated (i.e., the label tuple with the highest probability in the inference data structure) fell into 7 buckets. Here, trips with an empty inference data structure were counted as having a probability of 0.

Inferred p: number of trips
1.000: 154
0.667: 3
0.500: 6
0.407: 27
0.364: 12
0.286: 7
0.000: 67

I then compared these stated probability values to the fraction of trips in each bucket for which the inference actually matched the user labels:

Inferred p: fraction correct
1.000: 0.994
0.667: 0.667
0.500: 0.500
0.407: 0.407
0.364: 0.333
0.286: 0.286
0.000: 0.000

Presumably the reason for such a close correspondence here is due to how the clustering algorithm behaves having been trained on this data.

I then did some analysis on all confirmed trips across all users in the staging dataset. 2046 of these confirmed trips were fully labeled by users and had the following inference probability distribution:

Inferred p: number of trips (percentage of trips)
1.000: 748  (36.56%)
0.900: 10   (0.49%)
0.875: 8    (0.39%)
0.833: 20   (0.98%)
0.714: 2    (0.10%)
0.667: 15   (0.73%)
0.600: 6    (0.29%)
0.571: 6    (0.29%)
0.556: 7    (0.34%)
0.500: 62   (3.03%)
0.429: 22   (1.08%)
0.407: 27   (1.32%)
0.400: 15   (0.73%)
0.364: 12   (0.59%)
0.333: 10   (0.49%)
0.308: 20   (0.98%)
0.286: 7    (0.34%)
0.269: 21   (1.03%)
0.235: 15   (0.73%)
0.000: 1013 (49.51%)

3155 confirmed trips were fully unlabeled by users and had the following inference probability distribution:

Inferred p: number of trips (percentage of trips)
1.000: 149  (4.72%)
0.900: 8    (0.25%)
0.875: 9    (0.29%)
0.833: 19   (0.60%)
0.714: 1    (0.03%)
0.600: 56   (1.77%)
0.556: 2    (0.06%)
0.500: 67   (2.12%)
0.429: 138  (4.37%)
0.400: 2    (0.06%)
0.333: 1    (0.03%)
0.286: 1    (0.03%)
0.000: 2702 (85.64%)

142 confirmed trips were partially labeled by users — i.e., the user filled in some of the labels for the trip but not all of them.

Here are some graphs visualizing this data: Probability distribution of most-likely inferences, full range

From the graphs, we see that a significant fraction (85.6%) of the unlabeled trips have no inference at all, more so than for labeled trips (49.5%). There are also more labeled trips with 100% certainty (36.6%) than unlabeled (4.7%). However, aside from these endpoints, the trend is reversed — unlabeled trips tend to cluster towards the middle and upper end of the probability spectrum, whereas labeled trips are more evenly distributed.

shankari commented 3 years ago

First, these numbers are likely wrong, because although I build the model with a radius of 500 https://github.com/e-mission/e-mission-server/pull/829/files#diff-5281eebce1b462a2a39465cd785e4f36572ec618a3a79efbdfdf35fc508a9c90R64

I forgot to change the radius for the prediction to 100 https://github.com/e-mission/e-mission-server/pull/829/files#diff-18a304dace1163481f6faf1cd707af237b2d6766b4508e73caeacb1f51056b48R65

We should re-analyse after doing that.

shankari commented 3 years ago

To re-run after changing the radius locally, use https://github.com/e-mission/e-mission-server/pull/829#issuecomment-892186091

shankari commented 3 years ago

First, I don't think that the results on the labeled data are meaningful, because we are essentially testing the model on the training data. From the aggregate results, focusing only on unlabeled, it seems like we want to have a threshold of something like 0.4? That will ensure that most trips that have labels will show them instead of being converted to red labels but will still filter out the very low quality labels.

shankari commented 3 years ago

Thinking out loud, we have a lot of unlabeled trips with no inferences, and, based on https://github.com/e-mission/e-mission-eval-private-data/pull/28#issuecomment-887294253 there is significant variability between the labeling % for users, primarily determined by how many labeled trips we already have.

So we may want to briefly plot out a per-user histogram, or focus only on users with more than 50-100 trips because if people have too few trips, we can tell them that we won't be able to predict.

Or maybe have two different distributions for people with > 20% labeled v/s < 20% labeled etc

shankari commented 3 years ago

@GabrielKS one challenge with configuring with 0.4 is the same_mode issue. We used to allow same_mode as an option for the replaced mode and at least some times, the percentage goes below 0.4 because of slight differences there - e.g. see below.

[{'labels': {'mode_confirm': 'drove_alone',
   'purpose_confirm': 'home',
   'replaced_mode': 'drove_alone'},
  'p': 0.3333333333333333},
 {'labels': {'mode_confirm': 'drove_alone',
   'purpose_confirm': 'home',
   'replaced_mode': 'same_mode'},
  'p': 0.08333333333333333},
 {'labels': {'mode_confirm': 'drove_alone',
   'purpose_confirm': 'shopping',
   'replaced_mode': 'drove_alone'},
  'p': 0.16666666666666666},
 {'labels': {'mode_confirm': 'drove_alone',
   'purpose_confirm': 'shopping',
   'replaced_mode': 'same_mode'},
  'p': 0.08333333333333333},
 {'labels': {'mode_confirm': 'shared_ride',
   'purpose_confirm': 'home',
   'replaced_mode': 'drove_alone'},
  'p': 0.25},
 {'labels': {'mode_confirm': 'shared_ride',
   'purpose_confirm': 'shopping',
   'replaced_mode': 'drove_alone'},
  'p': 0.08333333333333333}]

Remapping same_mode to the same value as mode_confirm should solve that problem.

I removed the same_mode label from the UI on June 5th https://github.com/e-mission/e-mission-phone/commit/1806e6c72e9f5262c03bec230ce12971296ffd0f#diff-8f7e0cbf2ba6bd210c65bfcac14614c6fabfd3bd95b99f6d2974c615ddcef159

So none of the actual participants would have ever selected same mode

But we should handle it on the staging server before tuning.

GabrielKS commented 3 years ago

The Jupyter Notebook I used for analysis is now pushed (without the results) to https://github.com/GabrielKS/e-mission-eval-private-data/tree/inference_confidence_analysis.

GabrielKS commented 3 years ago

Redoing the inference myself with the existing 100m radius threshold, I get slightly different numbers, perhaps because there were trips added to the dataset after inference had been run:

Probability distribution of all fully labeled:
Probability: number of trips (percentage of trips)
{
  1.000: 718  (37.20%)
  0.900: 10   (0.52%)
  0.875: 8    (0.41%)
  0.833: 20   (1.04%)
  0.714: 2    (0.10%)
  0.667: 12   (0.62%)
  0.571: 6    (0.31%)
  0.556: 7    (0.36%)
  0.500: 63   (3.26%)
  0.429: 12   (0.62%)
  0.407: 27   (1.40%)
  0.400: 25   (1.30%)
  0.367: 20   (1.04%)
  0.333: 22   (1.14%)
  0.300: 21   (1.09%)
  0.286: 7    (0.36%)
  0.235: 15   (0.78%)
  0.000: 935  (48.45%)
}

Probability distribution of all fully unlabeled:
Probability: number of trips (percentage of trips)
{
  1.000: 89   (3.40%)
  0.900: 8    (0.31%)
  0.875: 9    (0.34%)
  0.833: 19   (0.73%)
  0.714: 1    (0.04%)
  0.556: 2    (0.08%)
  0.500: 66   (2.52%)
  0.429: 57   (2.18%)
  0.400: 83   (3.17%)
  0.333: 1    (0.04%)
  0.286: 1    (0.04%)
  0.000: 2284 (87.18%)
}

Changing the 100m radius threshold to 500m, I get significantly different results:

Probability distribution of all fully labeled:
Probability: number of trips (percentage of trips)
{
  1.000: 751  (39.69%)
  0.917: 12   (0.63%)
  0.900: 10   (0.53%)
  0.875: 8    (0.42%)
  0.833: 24   (1.27%)
  0.800: 10   (0.53%)
  0.786: 14   (0.74%)
  0.778: 9    (0.48%)
  0.714: 7    (0.37%)
  0.700: 10   (0.53%)
  0.692: 13   (0.69%)
  0.667: 45   (2.38%)
  0.600: 15   (0.79%)
  0.583: 23   (1.22%)
  0.571: 6    (0.32%)
  0.556: 27   (1.43%)
  0.545: 11   (0.58%)
  0.531: 23   (1.22%)
  0.500: 179  (9.46%)
  0.455: 11   (0.58%)
  0.452: 42   (2.22%)
  0.429: 21   (1.11%)
  0.419: 42   (2.22%)
  0.407: 27   (1.43%)
  0.400: 30   (1.59%)
  0.393: 159  (8.40%)
  0.375: 8    (0.42%)
  0.367: 30   (1.59%)
  0.360: 25   (1.32%)
  0.333: 78   (4.12%)
  0.312: 32   (1.69%)
  0.300: 40   (2.11%)
  0.294: 17   (0.90%)
  0.286: 7    (0.37%)
  0.250: 58   (3.07%)
  0.235: 15   (0.79%)
  0.222: 6    (0.32%)
  0.200: 18   (0.95%)
  0.000: 29   (1.53%)
}

Probability distribution of all fully unlabeled:
Probability: number of trips (percentage of trips)
{
  1.000: 172  (7.06%)
  0.917: 1    (0.04%)
  0.900: 8    (0.33%)
  0.875: 9    (0.37%)
  0.833: 20   (0.82%)
  0.800: 4    (0.16%)
  0.786: 17   (0.70%)
  0.750: 3    (0.12%)
  0.714: 11   (0.45%)
  0.667: 2    (0.08%)
  0.583: 3    (0.12%)
  0.556: 12   (0.49%)
  0.531: 8    (0.33%)
  0.500: 53   (2.18%)
  0.452: 6    (0.25%)
  0.429: 66   (2.71%)
  0.419: 76   (3.12%)
  0.400: 102  (4.19%)
  0.393: 2    (0.08%)
  0.360: 5    (0.21%)
  0.333: 75   (3.08%)
  0.300: 1    (0.04%)
  0.294: 4    (0.16%)
  0.286: 1    (0.04%)
  0.250: 35   (1.44%)
  0.000: 1740 (71.43%)
}

Graphs for 500m: Probability distribution of most-likely inferences, full range

Many (544) more unlabeled trips have some sort of inference now. There is no longer nearly so strong a cutoff at 0.4.

shankari commented 3 years ago

From this, it looks like the threshold should be 0.25, but then, we will basically not exclude anything. Do you get different results on a per-user basis, or only looking at users with lots of trips?

shankari commented 3 years ago

wrt:

But we should handle it on the staging server before tuning.

the obvious fix would be to change the inputs in the database and re-run the pipeline. An alternate solution would be to fix it in the code, but that would require special casing the handling of replaced mode instead of working with user inputs generically.

shankari commented 3 years ago

There is existing code to map the replaced mode with same_mode (https://github.com/e-mission/e-mission-server/pull/829/files#diff-c7ece2e6b65a06d6fd262e2ca047f676b4050b865a1b2a1b3f91a85b72ca5460R48) but unfortunately, it is only called from the second round for now. And of course, it is specific to the replaced mode. We could re-introduce that for now instead of modifying the user inputs.

GabrielKS commented 3 years ago

That middle graph if we only consider users who have labeled at least 10 or 20 trips (which is 15/39=38% of users; there are no users who have labeled between 10 and 20 trips): User labeled at least 10 or 20 trips If we only consider those who have labeled at least 50 trips (9/39=23% of users): User labeled at least 50 trips I don't see that this suggests any obvious way forward.

shankari commented 3 years ago

@GabrielKS back of the envelope estimate of the difference https://github.com/e-mission/e-mission-eval-private-data/pull/28#issuecomment-892420975

Some actual values are:

	cluster_label	before_unique_combo_len	after_unique_combo_len	before_max_p	after_max_p
25	0	10	9	0.423077	0.615385
26	1	5	4	0.545455	0.727273
27	2	6	4	0.4	0.5
29	4	6	5	0.166667	0.333333
32	7	2	1	0.666667	1

shankari commented 3 years ago

@GabrielKS If the out-and-back errors are common, both on staging and on the real deployments, I think I might be able to come up with a way to fix at least that pattern automatically. But it will take me ~ 3-4 days with no time for anything else.

Detecting the pattern (as opposed to fixing it) is much easier - I've basically already implemented it.

I think it might be worthwhile to work this into the expectation code somehow, maybe to mark trips as "double check". Let's discuss at today's meeting.

GabrielKS commented 3 years ago

The middle graph with a threshold of 50 trips after the same_mode mapping was reenabled:

shankari commented 3 years ago

Although this is going to make the threshold meaningless at this point, I think we should go with 25% as the threshold aka show all the trips. This is because although the number of trips affected is low, the number of trips for which we have inferences at all is also low. I think it is more important to give people the sense that we're doing something than it is to be perfect with the accuracy. We can always tune this after the first two weeks if needed although analysing those results will be in the future work.

shankari commented 3 years ago

Just for the record, the value in here don't seem to change much https://github.com/e-mission/e-mission-docs/issues/656#issuecomment-893024116

but when I did the side by side comparison (with boxplots), I got a pretty significant change: https://github.com/e-mission/e-mission-eval-private-data/pull/28#issuecomment-892724090

One difference between the two is that @GabrielKS is looking at the matched inferences, while I am looking at the clusters in the model. So maybe this is skewed by the fact that there aren't a lot of matches?

shankari commented 3 years ago

So using the actual inferences allows us to look at what the impact to this set of users and this trip history would be. Looking at the clusters directly, we get what could happen if we had better matching.

But that also seems to argue for something between 20% and 40%, so I am happy with 25%

e-mission / e-mission-docs

Analysis of label inference confidence #656