Closed corinne-hcr closed 3 years ago
@corinne-hcr this works for me.
--- a/tour_model_eval/trips_in_bins_and_clusters.ipynb
+++ b/tour_model_eval/trips_in_bins_and_clusters.ipynb
@@ -234,7 +234,7 @@
"# cluster the data using k-means\n",
"# def cluster(data, bins)\n",
"# return feat.clusters, feat.labels, feat.data\n",
- "clusters,labels,feat_data = pipeline.cluster(bin_trips,bins)\n",
+ "clusters,labels,feat_data = pipeline.cluster(bin_trips,len(bins))\n",
"clusters"
]
},
@corinne-hcr I've made the change to call the cluster code. Do you need me to work on the plotting as well, or can you take it from here?
OK, I think I can take it from here
I have a high level comment on your code structure which you need to address as well. Let me know if I should add some sample code for that as well.
Otherwise, push early and often!
I have a question, how can you modify the file that you haven't merged? Now I can change my file like the you do, but then I commit the file, there will be repeated modifications.
you can pull from your branch (origin <branch>
)
I can't plot the cluster graph. I can't access the points through the pipeline.cluster, I have to run featurization.featurization(bin_trips)
to get the points.pipeline.cluster
doesn't returnpoints
Also, I have questions about labels.
[1, 1, 1, 1, 1, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 2, 2, 3, 3, 0, 0, 4, 4, 8, 8, 7, 7]
Since I don't see other sort method, does it mean the five trips from the beginning belong to label 1?
@corinne-hcr please check my commit. Both the bin and the cluster displays work, both before and after deleting novel (uncommon) bins. Please use this to figure out the difference between bins and clusters before the 10am meeting tomorrow. You may want to visualize for all users - I think that the user you have selected has nClusters == nBins
DEBUG:root:number of bins after filtering: 8
DEBUG:root:number of clusters: 8
Again, you are encouraged to add log statements to the underlying server code to understand it better.
I have some questions.
1.In Trips from all bins
, you call similarity.similarity(trips, radius)
and sim.bin_data()
. It shows DEBUG:root:After removing trips that are points, there are 51 data points
But in Trips from bins above the cutoff only
, That was the way I called pipeline.remove_noise(trips, radius)
. It shows DEBUG:root:After removing trips that are points, there are 50 data points
The number of data points is different. Shouldn't it be the same number of data points?
2.After calling pipeline.remove_noise(trips, radius)
, it shows
the new number of trips is 22
the cutoff point is 8
number of bins after filtering: 8
But in my previous code, I remember I had around 23 trips, but the cutoff point was 9, and there were 9 bins.
That also shows in theviz_similarity_unlabeled.py
There are 9 bins above the cutoff point.
I don't know how that affect the outcome.
@corinne-hcr
Great questions! I would encourage you to experiment with the code (potentially by adding additional log statements) or running individual code snippets to resolve the discrepancies.
wrt answers, I don't know either. Again, I want to reiterate that this is not a class. I don't know the answers. You have to find out the answers and tell me.
Outcome:
bins/clusters on all data
and bins/clusters above cutoff points
separately. I also change a dataset with more data.min_cluster
= 2, I find some materials online that also start with k=2. I think that clustering means at least it divide one big group into two clusters.max_cluster
, she used 1.5 * bins
. I think that might be an empirically setting, but based on the implementation, the final number of clusters never hit the max. kmedoids
, I think we can delete it. We never use this in implementation, I think she just left an option here.DEBUG:root:There are 88 bins before filtering
DEBUG:root:number of clusters: 33
Bins are based on the distance of start points and end points of two trips. Clusters are based on silhouette score.
I add some logs on the server code. Here is how silhouette score works to determine the clusters.
So, we can see here, (the original max she set was 2) if sil > max, max will be set to equal to sil. As tests keep going, the sil reach a peak, and then decrease. When cluster = 33 is the peak of sil score. That how sil score finds the best cluster for all data.
There is r = max_clusters - min_clusters+1
in cluster
function infeaturization.py
. Since min_cluster = 2, max_clusters = 1.5*bins, r is the test times from 2 to max_clusters.Here is the trip labels after clustering
For bins
DEBUG:root:After removing trips that are points, there are 154 data points
DEBUG:root:number of bins before filtering: 88
DEBUG:root:the new number of trips is 52
DEBUG:root:the cutoff point is 6
DEBUG:root:number of bins after filtering: 6
DEBUG:root:The list of bins is [[9, 14, 23, 25, 32, 34, 39, 53, 67, 69, 71, 87, 92, 112, 114, 128, 149, 151, 153], [10, 15, 24, 26, 33, 35, 40, 58, 68, 70, 88, 93, 113, 115, 150, 152], [5, 18, 77, 80, 142], [11, 16, 74, 131], [22, 64, 109, 130], [52, 61, 111, 134]]
For clusters
Instead of following the original code min = bins
, I set min = 0
, so I can see if sil score also finds the same number of clusters.
Here is what I have
DEBUG:root:number of clusters: 6
So, given the same amount of trips, sil score also find the same number of clusters, and the trip labels distribution is exact the same as the list of bins (same order and same number of trips in one cluster/bin).
-Why not just use clustering I think, the problem is, I don't see a method to find common trips after clustering(if we put all data into analysis, how to find 6 common trips clusters out of 33). The elbow method in clusters is just to find the k in given data. So that she used cutoff point in bins can be considered a good way to find common trips.
Next, I am going to test on different users and see how all data result in bins and cluster. Also, I will visualize some part of the bins and clusters.
@corinne-hcr this is very good. Couple of clarifications:
As you have outlined, the next tasks are:
@corinne-hcr "I think, the problem is, I don't see a method to find common trips after clustering(if we put all data into analysis, how to find 6 common trips clusters out of 33)."
Given this, does it matter that we perform steps in the pipeline order by binning and then clustering. If binning and clustering are largely independent, would it work to cluster first, use the silhouette score to determine k and choose that as the elbow?
Update: Test on different users and see how all data result in bins and cluster. But I will actually put all data and above cutoff data results here since I notice there is something we can discuss.
user1:
-all data
DEBUG:root:There are 110 bins before filtering
DEBUG:root:number of clusters: 14
-above cutoff only
DEBUG:root:number of bins after filtering: 16
DEBUG:root:number of clusters: 16
user2:
-all data
DEBUG:root:There are 37 bins before filtering
DEBUG:root:number of clusters: 8
-above cutoff only
DEBUG:root:number of bins after filtering: 9
DEBUG:root:number of clusters: 9
user3:
-all data
DEBUG:root:There are 143 bins before filtering
DEBUG:root:number of clusters: 62
-above cutoff only
DEBUG:root:number of bins after filtering: 14
DEBUG:root:number of clusters: 13
Note that there are actually 14 cluster labels
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 11, 11, 11, 11, 11, 11, 0, 0, 0, 0, 0, 8, 8, 8, 8, 8, 5, 5, 5, 5, 5, 10, 10, 10, 10, 7, 7, 7, 7, 4, 4, 4, 9, 9, 9, 2, 2, 2, 6, 6, 6, 12, 12, 12]
The sil score decreases from 14 clusters
DEBUG:root:sil is 0.950153323026658
DEBUG:root:sil > max True
DEBUG:root:sil is 0.950153323026658, max is 0.9283151017642491
DEBUG:root:The new max is 0.950153323026658
DEBUG:root:testing 14 clusters
DEBUG:root:sil is 0.9422369415328851
user4:
-all data
DEBUG:root:There are 32 bins before filtering
DEBUG:root:number of clusters: 13
-above cutoff only
DEBUG:root:number of bins after filtering: 2
DEBUG:root:number of clusters: 2
user5:
-all data
DEBUG:root:There are 196 bins before filtering
DEBUG:root:number of clusters: 2
-above cutoff only
DEBUG:root:number of bins after filtering: 17
DEBUG:root:number of clusters: 17
user6:
-all data
DEBUG:root:There are 71 bins before filtering
DEBUG:root:number of clusters: 8
-above cutoff only
DEBUG:root:number of bins after filtering: 9
DEBUG:root:The list of bins is [[1, 3, 12, 17, 23, 25, 32, 39, 41, 44, 47, 72, 104, 108], [4, 16, 19, 24, 26, 33, 40, 43, 46, 48, 78, 103, 107, 110], [28, 45, 100, 101, 102, 105, 106, 109, 123], [27, 75, 79, 82, 99, 113, 122], [2, 14, 81, 83, 98], [13, 42, 80, 120], [7, 20, 30], [9, 35, 62], [73, 77, 114]]
DEBUG:root:number of clusters: 6
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 5, 5, 5, 5, 5, 5, 5, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 2, 2, 2]
I think here we should plot it in map to see why different bins would be assigned same labels(clusters) Same as user 3
user7:
-all data
DEBUG:root:There are 88 bins before filtering
DEBUG:root:number of clusters: 33
-above cutoff only
DEBUG:root:number of bins after filtering: 6
DEBUG:root:number of clusters: 6
user8:
-all data
DEBUG:root:There are 268 bins before filtering
DEBUG:root:number of clusters: 4
-above cutoff only
DEBUG:root:number of bins after filtering: 46
DEBUG:root:number of clusters: 44
user9:
-all data
DEBUG:root:There are 164 bins before filtering
DEBUG:root:number of clusters: 87
-above cutoff only
DEBUG:root:number of bins after filtering: 27
DEBUG:root:number of clusters: 26
[0, 0, 0, 0, 0, 0, 0, 7, 7, 7, 7, 7, 7, 2, 2, 2, 2, 2, 2, 6, 6, 6, 6, 6, 6, 12, 12, 12, 12, 12, 12, 14, 14, 14, 14, 14, 14, 21, 21, 21, 21, 21, 13, 13, 13, 13, 13, 1, 1, 1, 1, 19, 19, 19, 19, 20, 20, 20, 24, 24, 24, 16, 16, 16, 3, 3, 3, 17, 17, 5, 5, 4, 4, 18, 18, 9, 9, 15, 15, 6, 6, 23, 23, 10, 10, 25, 25, 22, 22, 11, 11, 8, 8]
See label 6
, same thing here
user 10:
-all data
DEBUG:root:There are 102 bins before filtering
DEBUG:root:number of clusters: 35
-above cutoff only
DEBUG:root:number of bins after filtering: 21
DEBUG:root:number of clusters: 21
user11:
-all data
DEBUG:root:There are 132 bins before filtering
DEBUG:root:number of clusters: 27
-above cutoff only
DEBUG:root:number of bins after filtering: 14
DEBUG:root:The list of bins is [[12, 22, 49, 51, 53, 63, 72, 175], [13, 25, 38, 54, 73, 139, 144], [24, 35, 37, 90, 127, 143, 172], [7, 17, 27, 29, 118, 165], [78, 111, 133, 150, 201], [8, 28, 119, 166], [23, 36, 52, 126], [32, 87, 132, 146], [56, 92, 97, 117], [112, 134, 151, 202], [113, 182, 189, 195], [26, 140, 145], [50, 62, 64], [131, 154, 161]]
DEBUG:root:number of clusters: 8
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 2, 2, 2, 2, 1, 1, 1, 1, 5, 5, 5, 5, 6, 6, 6, 6, 0, 0, 0, 0, 2, 2, 2, 2, 4, 4, 4, 1, 1, 1, 7, 7, 7]
user12 skip
user13:
-all data
DEBUG:root:There are 303 bins before filtering
DEBUG:root:number of clusters: 185
-above cutoff only
DEBUG:root:number of bins after filtering: 11
DEBUG:root:number of clusters: 11
As we can see from the output above, bins are more sensitive than clusters. If we directly put all data in cluster then use elbow method, we may not find the common trips as expected. Of course, I need to plot some special cases where bins are significantly different from cluster.
Second concern, for data above cutoff, there are some example show that bin_trips
clusters are less than the number of bins
. I need to plot those on map to see if bins wrongly put similar trips into different bins or clusters incorrectly label the trips.(Maybe I will just need to plot those trips have same label but in different bins)
Note that I change min = len(bins)
to min = 0
in clusters above the cutoff only
. Naomi used min = len(bins)
, then bins
clusters
above cutoff will be the same.
all that sounds good. My only other suggestion would be to capture the values here {all trips: bins, cluster, above cutoff: bins, cluster} in a data structure (maybe a simple data frame) and plot it. We can then include that plot in the report along with the maps that visualize the discrepancies (user3, user6,...)
also, don't forget to push up your changes to the PR after that is done!
Do you mean use pandas data frame to show the values?({all trips: bins, cluster, above cutoff: bins, cluster}) Do you need the notebook now to see the difference? I didn't make many modifications. Maybe I push it later when I done with the map?
Yes, you should push after you collate the values and plot maps for the outliers
"pandas data frame to show the values" you can even plot them using DataFrame.plot
!
Partially update: the difference between trips in the same bin and trips in the same cluster
DEBUG:root:There are 37 bins before filtering
DEBUG:root:number of clusters: 8
Here is the map for bins
Here is the map for clusters.
As we can see, in the cluster (label 1), the trips are largely different (red lines). However, those trips are in different bins. On all data, bins make more sense than clusters.
DEBUG:root:number of bins after filtering: 14
DEBUG:root:number of clusters: 8
Here is the bins list
DEBUG:root:The list of bins is [[12, 22, 49, 51, 53, 63, 72, 175], [13, 25, 38, 54, 73, 139, 144], [24, 35, 37, 90, 127, 143, 172], [7, 17, 27, 29, 118, 165], [78, 111, 133, 150, 201], [8, 28, 119, 166], [23, 36, 52, 126], [32, 87, 132, 146], [56, 92, 97, 117], [112, 134, 151, 202], [113, 182, 189, 195], [26, 140, 145], [50, 62, 64], [131, 154, 161]]
I will compare the trips in the first 3 bins and the last 2 bins with trips in label 1
Here is the trips from the first 3 bins
Here is the trips from the last 2 bins
Here is the label list
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 6, 6, 6, 6, 6, 6, 3, 3, 3, 3, 3, 6, 6, 6, 6, 1, 1, 1, 1, 5, 5, 5, 5, 2, 2, 2, 2, 0, 0, 0, 0, 4, 4, 4, 1, 1, 1, 7, 7, 7]
Here is the trips in label 1
update the graph of bins and clusters (from data above cutoff) :
Actually, let me try and run the output myself before I merge
I mentioned there is some differences for clustering (min=0 and min=len(bins)). You don't have to run it right now. I will update it in a minute
When I try to run the viz_bins_clusters_above_cutoff
notebook, I get an error
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-7-0f3d0ef3449a> in <module>
----> 1 bin_trips, bins = pipeline.remove_noise(trips, radius)
~/e-mission/e-mission-server/emission/analysis/modelling/tour_model/cluster_pipeline.py in remove_noise(data, radius)
54 sim.bin_data()
55 logging.debug('number of bins before filtering: %d' % len(sim.bins))
---> 56 sim.delete_bins()
57 logging.debug('number of bins after filtering: %d' % len(sim.bins))
58 return sim.newdata, sim.bins
~/e-mission/e-mission-server/emission/analysis/modelling/tour_model/similarity.py in delete_bins(self)
97 def delete_bins(self):
98 self.calc_cutoff_bins()
---> 99 for i in range(len(self.bins) - num):
100 self.bins.pop()
101 newdata = []
NameError: name 'num' is not defined
It looks like https://github.com/e-mission/e-mission-server/pull/792 introduced a regression in the cluster pipeline.
Ah looks like it is fixed in https://github.com/e-mission/e-mission-server/pull/794/files Thanks @corinne-hcr
I can in fact run the notebook through. Here's the graph that I get
I get this, too. I can't get the one on all data. It never stops.
All clusters also finished running for me
all bins | all clusters |
---|---|
92 | 17 |
37 | 8 |
160 | 47 |
36 | 17 |
206 | 2 |
76 | 8 |
97 | 32 |
311 | 4 |
171 | 4 |
122 | 2 |
132 | 27 |
324 | 192 |
Can you screenshot the full data graph? (with the index) thanks!!!
so it looks like another difference between the bin and cluster code is the time taken to run them. If you can add timing statements (https://stackoverflow.com/questions/29280470/what-is-timeit-in-python), I can re-run the notebooks on my laptop and determine the actual difference.
Can you screenshot the full data graph? (with the index) thanks!!!
isn't that just https://github.com/e-mission/e-mission-eval-private-data/pull/16#issuecomment-778003311
I also added the dataframe in https://github.com/e-mission/e-mission-eval-private-data/pull/16#issuecomment-778005050
I think it should look like
I think it should look like
that is not full data, but only above cutoff, right?
Yes, I can only get the cutoff on my laptop. But for all data, I should have set a similar data frame. Did you see the similar thing on all data? Should be a complete data frame with index(user name)
above cutoff bins | above cutoff clusters (min = len(bins)) | above cutoff clusters (min = 0) | |
---|---|---|---|
user1 | 17 | 17 | 17 |
user2 | 9 | 9 | 9 |
user3 | 16 | 16 | 15 |
user4 | 2 | 2 | 2 |
user5 | 20 | 20 | 19 |
user6 | 6 | 6 | 3 |
user7 | 7 | 7 | 7 |
user8 | 28 | 28 | 26 |
user9 | 27 | 27 | 26 |
user10 | 25 | 25 | 24 |
user11 | 14 | 16 | 8 |
user13 | 14 | 14 | 14 |
Did you see the similar thing on all data? Should be a complete data frame with index(user name)
The dataframe for all data is at https://github.com/e-mission/e-mission-eval-private-data/pull/16#issuecomment-778005050 It is true that the index didn't get copy-pasted, but it is the same order as the cutoff - user1 ... user13 with user12 missing
This is better than a screenshot because you can copy-paste it and get a dataframe back without having to type out all the numbers :)
{'all bins': {'user1': 92,
'user2': 37,
'user3': 160,
'user4': 36,
'user5': 206,
'user6': 76,
'user7': 97,
'user8': 311,
'user9': 171,
'user10': 122,
'user11': 132,
'user13': 324},
'all clusters': {'user1': 17,
'user2': 8,
'user3': 47,
'user4': 17,
'user5': 2,
'user6': 8,
'user7': 32,
'user8': 4,
'user9': 4,
'user10': 2,
'user11': 27,
'user13': 192}}
OK thanks!
Once you address the fit_bounds
, I am ready to approve and merge.
@corinne-hcr merged. I squashed before merging to avoid lots of back and forth on the code redesign
Here is the bin trips on the map.
Here is the problem when running the
pipeline.cluster