Closed shankari closed 2 years ago
@corinne-hcr @GabrielKS
Initial results with DBSCAN based trip clustering (https://github.com/e-mission/e-mission-eval-private-data/pull/28/commits/c7e8205353daba174833057c60af8d4b4df83844)
So we can guess that most users should be prompted ~ < 50% of the time
Definitely, the minipilot data was harder to work with, but its median is also below 40%. As expected NREL is best, but staging is also respectable. I don't see any reason why we should have been able to build only models for only three users on staging.
Comparison between DBSCAN and similarity for one user is complete: https://github.com/e-mission/e-mission-eval-private-data/pull/28/commits/18144ef4d1b839196f6b0e3dba3756f2789aacd6
Next step is to incorporate this back into the generalization across datasets and see if the results are generally relevant.
@corinne-hcr with https://github.com/corinne-hcr/e-mission-server/pull/2/commits/7a759905d5a4703cdd27fee14c766e21b06fc85c, https://github.com/corinne-hcr/e-mission-server/pull/2/commits/58a14a8e7d53cd616affaa52eed84a079e9cde0d and https://github.com/corinne-hcr/e-mission-server/pull/2/commits/d81d25b95a0d7e82c7f8b417b828375d3e08e22d
all the notebooks in this repo are runnable.
@corinne-hcr Tradeoffs for various combinations of similarity parameters and radii for the mini-pilot are done: https://github.com/e-mission/e-mission-eval-private-data/pull/28/commits/abf4f78b417de7190cd35b8e3e1877df938ee1be
IMHO, this shows the differences more clearly than the scatter plot.
Top: request_pct, Bottom: homogeneity score L-R: 100m, 300m, 500m
I'm going to integrate this into the dataset comparison before poking around with this some more.
Other analyses I would like to do are:
Generalization results:
However, the cluster_trip_pct that I had defined earlier still shows a significant difference w. Need to understand it better and figure out why it is different and why it doesn't capture the true metric.
I will briefly attempt to figure that out tomorrow morning, but based on these results, we can stick to a single level of "clustering", use a 500m radius, and don't filter or delete bins. I will attempt to make the changes to the server code tomorrow, but if @corinne-hcr has finished her work, maybe that is the next task for her to tackle.
Some more multi-dataset results, including an exploration on the number of trips required to be meaningful, and an explanation of the cluster trip ratio v/s request pct discrepancy.
Results from: https://github.com/e-mission/e-mission-eval-private-data/pull/28/commits/31733c1ee00375f3ca7b795c2937179ae31d79ad
I was surprised that the homogeneity score of DBSCAN was so low, and then I realized that I was computing it incorrectly.
Basically, I was just passing in the labels from DBSCAN as the predicted labels, but all the noisy trips have the same label (-1), instead separate labels, one for each noisy trip. This is likely the reason why the scores are lower.
For example, consider the case in which we have two clusters of length 2 each, and 4 single trip clusters.
If all the single trip clusters are labeled with -1
for noise, we will end up with
>>> sm.homogeneity_score([1,1,2,2,3,4,5,6], [0,0,1,1,-1,-1,-1,-1])
0.5999999999999999
because it looks like the -1
predicted cluster actually munges entries from 4 different ground truth clusters.
If we replace them with unique cluster labels, we get a perfect score, as expected.
>>> sm.homogeneity_score([1,1,2,2,3,4,5,6], [0,0,1,1,2,3,4,5])
1.0
I am almost certainly not going to use DBSCAN in the integrated pipeline, and that is my current priority, so I do not plan to fix this now. But if @corinne-hcr wants to write a paper, maybe she can fix it here?
@corinne-hcr to understand this PR, and to compare it with your prior work, I would use the following order:
minipilot
), this should be similar to your original exploration of the similarity code: https://github.com/e-mission/e-mission-eval-private-data/pull/28/commits/abf4f78b417de7190cd35b8e3e1877df938ee1beYou will not be able to absorb the results just by looking at the code. You need to check out this branch and actually run the notebooks in the commits above. The notebooks have inline explanations (they are "notebooks" after all) of what I'm trying to understand and what the graphs mean. Most of them are "unrolled" so that I first try one option and then another, so that you can see the evolution of the analysis.
For the last, multi-dataset notebook, you will need the combined dataset JSON file. I have shared that with you through OneDrive.
Please let me know if you have any high-level questions.
For the box plots which titled num labeled trip
, what do the y-axes represent? Are they request pct?
you should be able to see this from the code - e.g. https://github.com/e-mission/e-mission-eval-private-data/pull/28/files#diff-5b27f01eda7481b2844df59e55fffa030ca22df0c58f9796b8d1bb7ad13b1089R679
Again, You need to check out this branch and actually run the notebooks in the commits above. The notebooks have inline explanations (they are "notebooks" after all) of what I'm trying to understand and what the graphs mean. There are additional plots in the notebooks.
The plots are not designed to be published without modification - they were primarily designed for me to understand what was going on so I could figure out how to modify the evaluation pipeline. If you choose to use any of them, you will need to ensure that all the labels are in place and the font sizes are appropriate.
The nrel-lh dataset is from 4 NREL employees who voluntarily collected and labeled their data. The staging dataset is from ~ 30 program staff for the CEO full pilot who helped with testing the platform before deployment. We have/should have demographic information for all of those datasets as well from the onboarding survey.
@corinne-hcr from looking at your code (get_first_label
and score
), it looks like you calculate the homogeneity score only of the trips that are in the bins. So if I have 10 trips with the following bins/clusters [[0,1,2,3],[4,5,6],[7],[8],[9]]
before cutoff, and the following bins/clusters [[0,1,2,3],[4,5,6]]
after cutoff, I believe the labels_pred will be [[0,0,0,0], [1,1,1]]
; is that correct?
But you compute the request percentage taking the full, non-cutoff list into account; you have a request for each of [7],[8],[9]
.
Can you discuss the reason for that mismatch further? We should standardize on a really clear definition of the metric calculations because otherwise we don't know what the metrics mean!
I actually raised that question at the first term. But at this point, I don't remember your explanation clearly. Let's see if there is some records.
So I actually implemented both alternate metrics (below cutoff as single trip clusters, and drop trips below cutoff) https://github.com/e-mission/e-mission-eval-private-data/pull/28/commits/e45754599627f67bd60db33fdbac3a1e65c7af54
The first result is pretty much identical to no_cutoff, the second is pretty much identical to the old metric
Also please see my discussion around h-score and request pct in the notebook (https://github.com/e-mission/e-mission-eval-private-data/pull/28/commits/e45754599627f67bd60db33fdbac3a1e65c7af54). Maybe we should compute the request_pct only on a split and not on the whole dataset.
@corinne-hcr @GabrielKS, before making the changes to the evaluation pipeline, I did some super hacky analysis to compare the old models with the old settings, and my new, preferred settings. I had to copy over the tour_model
directory into tour_model_first_only
and other cringeworthy changes so that I could run the two modules side by side.
Here are the results:
Due to a combination of the trip filtering, user validity checking, training on a split, and filtering the trips below the cutoff, the old model got few/no trips to train the model. This is particularly true for the later entries which are from the newer datasets. Sorry not sorry, I didn't have time to color code by dataset. This is a big deal for the location based clustering algorithms, which essentially take a nearest neighbor approach. If a particular bin is not included in the model, it will never be matched.
Note that with the old model, because we have so few trips in the model, we are not even infer labels for already labeled trips. In contrast, with the new model, at least all existing trips are labeled.
program | valid_trip_count_new | unlabeled_predict_pct_old | unlabeled_predict_pct_new | |
---|---|---|---|---|
11 | minipilot | 0 | nan | 0 |
18 | nrel-lh | 6 | nan | 0 |
20 | nrel-lh | 0 | nan | 0 |
21 | nrel-lh | 17 | nan | 0.145038 |
35 | stage | 3 | nan | 0.0597015 |
36 | stage | 6 | nan | 0.176471 |
38 | stage | 0 | nan | 0 |
39 | stage | 28 | 0 | 0.117647 |
40 | stage | 0 | nan | 0 |
41 | stage | 9 | nan | 0.0814815 |
43 | stage | 0 | nan | 0 |
44 | stage | 0 | nan | 0 |
47 | stage | 1 | nan | 0 |
58 | stage | 48 | nan | 0.177778 |
59 | stage | 0 | nan | 0 |
62 | stage | 0 | nan | 0 |
I'm now going to move this code from the notebook into the server code. Will clean up the notebook code and submit on Wednesday.
I think I read through the notebooks, but there are still some questions.
https://github.com/e-mission/e-mission-eval-private-data/pull/28#issuecomment-887125096
Which one is the first result
? Compare with no_cutoff
from which metric?
Which one is the second result
?
The Old implementation
is the one you assign -1 and -2 for the noise and too_short trips, right?
The Old implementation is the one you assign -1 and -2 for the noise and too_short trips, right?
Yes
Which one is the first result? Compare with no_cutoff from which metric? Which one is the second result?
From my comment
So I actually implemented both alternate metrics (below cutoff as single trip clusters, and drop trips below cutoff) The first result is pretty much identical to no_cutoff, the second is pretty much identical to the old metric
So the first result = below cutoff as single trip clusters second result = drop trips below cutoff
same_mode
replaced mode mappingx-axis: user y-axis: number of affected clusters total across users: 21 clusters
Given that there are only 15 users with substantial numbers of trips (https://github.com/e-mission/e-mission-docs/issues/656#issuecomment-892329913), this is not too shabby. Specially since there doesn't appear to be any downside.
Another viz for the same data.
In a full boxplot, the 1.0 max_p dominates, making it hard to see the difference.
But if we only pick the values that were not 1 (label_result_df.query("before_max_p < 1")), the difference is more visible.
For the record, the median line is not visible in after_max_p because the 25% and the 50% are the same.
count 122.000000
mean 0.536492
std 0.169980
min 0.125000
25% 0.500000
50% 0.500000
75% 0.666667
max 1.000000
May I know which part is the explanation for using oursim?
you mean versus DBSCAN? As you can see from https://github.com/e-mission/e-mission-eval-private-data/pull/28#issuecomment-885960132 the comparison between oursim and DBSCAN for one user is at https://github.com/e-mission/e-mission-eval-private-data/commit/18144ef4d1b839196f6b0e3dba3756f2789aacd6
Generalizing that to multiple datasets is in the standard Explore_multiple_datasets
notebook.
@corinne-hcr @GabrielKS I've been talking with both of you separately on evaluation, but wanted to put my thoughts down here so that we could all be on the same page. I've been going over this with @corinne-hcr for a while, but I don't feel like we were coming to any consensus, and I wanted to spend some of my own time thinking about it.
Feedback highlighting inconsistencies appreciated!
Part of the reason I think we have been struggling with this is because we have been conflating evaluation of cluster quality and prediction quality.
When @corinne-hcr did her original evaluation, we used two metrics:
The reason for calling it the request % is that we assumed that we would have to ask the user once for each cluster and then assign that label forever going forward. In that case the number of unique clusters = number of requests.
The intuition is that these could capture the tradeoff inherent in clustering - larger clusters would reduce the number of requests but decrease the h-score, and smaller clusters would increase the homogeneity but at the cost of increased requests. I will note that the literature has a metric called "completeness score" which seems similar to the request % in terms of being better for larger clusters, but we felt that tying the metric more explicitly to our problem definition was a good idea.
However, with the modified system design, the request % is not necessarily a representation of the number of requests. Depending on the expectation configuration, we may ask users to label multiple trips for the same cluster. A better term, and one that I suggest we use going forward is the cluster_to_trip_ratio or cluster_to_trip_pct, which is really what it is. We could also use compactness score, but I am concerned that then people will be confused with the completeness score 😄
Couple of notes about these metrics:
We may want to display these metrics to the user, along with the % of labeled trips, to give them an indication of how well the model is working for their travel - explainability!
The cluster quality is the not the same as the prediction quality. We could have perfect clusters, but if they don't match, they are not useful. I can think of three metrics for evaluating prediction quality:
n_predicted / n_trips
This is not strictly required, since the standard accuracy metrics should also cover trips without predictions (they will not match). However, this is a way to generate a useful metric on a much larger and arguably more relevant dataset since we don't have to split off a separate test set.Thanks for the clear explanation of homogeneity score vs. request percent/cluster to trip ratio; I was feeling like I needed that.
If we display these metrics to the user, we should spend some time thinking through how to do that without overwhelming them with technical information.
I think adjusting confidence by sample size could be an easy way to deal with the last issue described, and I agree it would be easy to write about concisely. I'm not entirely clear on what a second round of modeling would otherwise consist of, and it makes sense to save that for future work.
Discussed with @shankari, here are some questions and answers:
precision_score
and recall_score
matter. In the real world, y_true should be a dictionary.Adding additional detail instead a summary that will not make sense in 6 months without context:
Questions from @corinne-hcr:
Answers from @shankari
What do you mean by perfect clusters?
1.0 v-score. All trips with the same labels are in one and only one cluster
How do you define prediction quality?
How well the model predicts trips that are not in the model
From @corinne-hcr
>>from sklearn.metrics import precision_score
>>>y_true = ['car', 'bike', 'walk', 'car', 'bike', 'walk']
>>>y_pred = ['car', 'walk', 'bike', 'car', 'car', 'bike']
>>>precision_score(y_true, y_pred, average=None)
array([0. , 0.66666667, 0. ])
>>from sklearn.metrics import precision_score
>>>y_true = [0, 1, 2, 0, 1, 2]
>>>y_pred = [0, 2, 1, 0, 0, 1]
>>>precision_score(y_true, y_pred, average=None)
array([0.66666667, 0. , 0. ])
>>from sklearn.metrics import precision_score
>>>y_true = [0, 1, 2, 0, 1, 2]
>>>y_pred = [1, 3, 2 ,1, 1, 2]
>>>precision_score(y_true, y_pred, average=None)
array([0. , 0.33333333, 1. , 0. ])
So, if we are using precision/recall score, we can only use the true tuple, not the numeric labels that assigned by the indices(we would probably have different numeric labels for the same label tuple)
From @shankari
I explicitly said that we should get the labels for each trip as the predicted labels: I am not sure what you mean by "numeric labels" if you are talking about cluster labels, I also explicitly said that, for the prediction metrics, we don't care about clusters since that is an implementation detail of the model
From @corinne-hcr
Some more questions.
"Note that trips without predictions will not match" Do you mean trips that cannot find a matching cluster will not have user labels? In that case, how to do with labels_pred?
"Presumably we would take the highest probability option as the true label." If we need to assign labels to be labels_true, why do we need to split the data? Are you saying we should take the highest probability labels as the labels_pred? I originally didn't focus much on the prediction part since I just return possible user labels and their p, but not really predict a specific user labels combination for the trip.
You mention 2 metrics for Evaluating prediction quality, I had a same idea as the 2nd one. Are we using both of them or just one of them?
From @shankari
in that case, labels_pred will be {}. So it won't match labels_true
that was awkwardly phrased. "Presumably we would take the highest probability option as the final predicted label"
I put this in the writeup "predicted_trip_pct with unlabeled trips: n_predicted / n_trips This is not strictly required, since the standard accuracy metrics should also cover trips without predictions (they will not match). However, this is a way to generate a useful metric on a much larger and arguably more relevant dataset since we don't have to split off a separate test set." What part of this is not clear?
From @corinne-hcr
>>from sklearn.metrics import precision_score
>>>y_true = ['car', 'bike', 'walk', 'car', 'bike', 'walk']
>>>y_pred = ['{}', 'walk', 'bike', 'car', 'car', 'bike']
>>>precision_score(y_true, y_pred, average=None)
array([0. , 0.5, 0. , 0. ])
If the labels_pred is {}, the score is not correct. It is not giving result as 1/3
From @shankari
From @corinne-hcr
tp =1
tp/(tp+fp) isn't it 1/3?
oh it looks like 1/2
Oh, that seems the result is correct
There are some more questions:
Responses:
The issues with mislabeled ground truth are true across the board. While they apply to the h-score, they also apply to the precision and recall metrics. So our options are:
For which part? The proposal is to use the h-score for the cluster quality. Not sure why we would want to use unlabeled trips to evaluate the clusters. Can you elaborate?
"characterize the noise": we are saying that we have lots of noisy trips. characterizing the noise would involve showing the amount of noise, potentially using a chart
in the final comparison across datasets, I picked the no_filter
, no_cutoff
option. But there are other notebooks in which I explored those options as well. So I don't agree with the statement "the inappropriate point for including them for evaluation". We have to first evaluate those (as I did in the other notebooks) to make the argument for the final pick.
Also, the notebooks were an aid for me to convince myself about the settings for the final system. That doesn't preclude having a more extensive evaluation, even across all datasets, to put into a report or a peer reviewed publication.
Just summarize the answer, please let me know if I understand it incorrectly:
no_filter
no_cutoff
, first show the comparisons boxplot, then explain that keeping the noise enable us to have more predicted trips.The comparison boxplot is something like https://github.com/e-mission/e-mission-eval-private-data/pull/28#issuecomment-887125096 But currently the boxplots show the h-score in different situations. I am not sure if we need to use the boxplot for h-score or for request% or for cluster_to_trip_pct. I think we can use the boxplots that treats trips below cutoff as single trip clusters instead of putting all three situations in the paper. Treating trips below cutoff as single trip cluster meet our current design. Also, discussing different metrics(filter/no filter, cutoff/no cutoff) in one situation is more clear.
@corinne-hcr those boxplots are for the h-score, which are used to evaluate cluster quality. My point was that I don't think you can use them to evaluate prediction quality. The graphs to evaluate the prediction quality are https://github.com/e-mission/e-mission-eval-private-data/pull/28#issuecomment-887294253 but I don't think I made boxplots for them.
I think we can use the boxplots that treats trips below cutoff as single trip clusters instead of putting all three situations in the paper. Treating trips below cutoff as single trip cluster meet our current design. Also, discussing different metrics(filter/no filter, cutoff/no cutoff) in one situation is more clear.
First, filter/no filter, etc are not different metrics. They are different configurations, and we want to use metrics such as the h-score and the cluster-to-trip-ratio to understand them. Second, can you clarify what you mean by "discussing metrics ... in one situation"? What would be the rough structure of such a discussion? To me, the boxplots (or similar graphs) are the easiest way of comparing the configurations, but I'm open to hearing other concrete suggestions!
Right. The graphs for prediction quality are only those you put in there. The boxplot for prediction is not made yet. I was saying that we had three situations - old implementation, treat below cutoff as single label clusters, drop below cutoff. I think we can just use the one that treats below cutoff as single label clusters. Under this situation, the boxplot shows the h-score or cluster-to-trip-ratio for filter/no filter, cutoff/ no cutoff. Then we can say we determine to use no filter and no cutoff in order to keep more trips.
I just check the way you compute the h-score, I don't think you have na
in labels_true
for h-score tuple, but you use dropna before calculating the score. And we need to change the cluster labels for trips with -1
Could you check that again in case I made some mistake?
I don't think you have
na
in labels_true for h-score tuple, but you use dropna before calculating the score
Using dropna
doesn't hurt anything if we don't have any N/A. I added that because if we build separate models for the individual labels, we can have N/A. I think I poked at it a little bit in that notebook as well
https://github.com/e-mission/e-mission-eval-private-data/pull/28/files#diff-b53e99b317a902a7e95b56e64c541a6a68dafc348bef0ee5111d325dc0617bf1R677
but gave up exploring in detail due to lack of time.
I think we can just use the one that treats below cutoff as single label clusters.
I am not convinced by this because in that case, as I said in the notebook, there is effectively no difference in this metric between a similarity instance that drops trips below cutoff and one that does not. The metric is not meaningful to show anything important about the cluster quality.
Please see the results for re-introducing the same mode https://github.com/e-mission/e-mission-eval-private-data/pull/28#issuecomment-892420975 and https://github.com/e-mission/e-mission-eval-private-data/pull/28#issuecomment-892724090
The related commit is https://github.com/corinne-hcr/e-mission-server/commit/97921a93de796cb44ac3ea571a6a25d8e0684e14
Last few commits before we close out this PR:
Compare user mode mapping effect with outputs.ipynb
: plots what happens if we change replaced mode = "same mode"
to replaced mode = <actual mode>
. The effect is not very high, and we have removed the "same mode" option now, so I am not sure how useful this is over the long termExplore sim usage (common trips -> labeling) unrolled-outputs.ipynb
: The existing "Explore sim usage unrolled" notebook with embedded outputs; to give people a sense of what the expected outputs are. Not sure why I have the outputs only for this notebook; can remove if we can reproduce.Exploring basic datasets for model building validity*.ipynb
: comparing the effect of the old (emission/analysis/modelling/tour_model
) and new models (emission/analysis/modelling/tour_model_first_only
). We have not used the old models in production for a while, but keeping these around for demonstrating the effect, and as a template for evaluating the next round of model improvements.@hlu109 I have committed all the pending changes on my laptop and moved the obsolete analyses out. I am now merging this.
I would suggest running through my notebooks here to understand the analysis step by step.