Bins and clusters - Githubissues

corinne-hcr commented 3 years ago

Here is the bin trips on the map.

Here is the problem when running the pipeline.cluster

shankari commented 3 years ago

@corinne-hcr this works for me.

--- a/tour_model_eval/trips_in_bins_and_clusters.ipynb
+++ b/tour_model_eval/trips_in_bins_and_clusters.ipynb
@@ -234,7 +234,7 @@
     "# cluster the data using k-means\n",
     "# def cluster(data, bins)\n",
     "# return feat.clusters, feat.labels, feat.data\n",
-    "clusters,labels,feat_data = pipeline.cluster(bin_trips,bins)\n",
+    "clusters,labels,feat_data = pipeline.cluster(bin_trips,len(bins))\n",
     "clusters"
    ]
   },

shankari commented 3 years ago

@corinne-hcr I've made the change to call the cluster code. Do you need me to work on the plotting as well, or can you take it from here?

corinne-hcr commented 3 years ago

OK, I think I can take it from here

shankari commented 3 years ago

I have a high level comment on your code structure which you need to address as well. Let me know if I should add some sample code for that as well.

Otherwise, push early and often!

corinne-hcr commented 3 years ago

I have a question, how can you modify the file that you haven't merged? Now I can change my file like the you do, but then I commit the file, there will be repeated modifications.

shankari commented 3 years ago

you can pull from your branch (origin <branch>)

corinne-hcr commented 3 years ago

I can't plot the cluster graph. I can't access the points through the pipeline.cluster, I have to run featurization.featurization(bin_trips)to get the points.pipeline.clusterdoesn't returnpoints Also, I have questions about labels. [1, 1, 1, 1, 1, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 2, 2, 3, 3, 0, 0, 4, 4, 8, 8, 7, 7]Since I don't see other sort method, does it mean the five trips from the beginning belong to label 1?

shankari commented 3 years ago

@corinne-hcr please check my commit. Both the bin and the cluster displays work, both before and after deleting novel (uncommon) bins. Please use this to figure out the difference between bins and clusters before the 10am meeting tomorrow. You may want to visualize for all users - I think that the user you have selected has nClusters == nBins

DEBUG:root:number of bins after filtering: 8

DEBUG:root:number of clusters: 8

Again, you are encouraged to add log statements to the underlying server code to understand it better.

corinne-hcr commented 3 years ago

I have some questions. 1.In Trips from all bins, you call similarity.similarity(trips, radius) and sim.bin_data(). It shows DEBUG:root:After removing trips that are points, there are 51 data points

But in Trips from bins above the cutoff only, That was the way I called pipeline.remove_noise(trips, radius). It shows DEBUG:root:After removing trips that are points, there are 50 data pointsThe number of data points is different. Shouldn't it be the same number of data points?

2.After calling pipeline.remove_noise(trips, radius), it shows the new number of trips is 22 the cutoff point is 8 number of bins after filtering: 8

But in my previous code, I remember I had around 23 trips, but the cutoff point was 9, and there were 9 bins. That also shows in theviz_similarity_unlabeled.py

There are 9 bins above the cutoff point.

I don't know how that affect the outcome.

shankari commented 3 years ago

@corinne-hcr

Great questions! I would encourage you to experiment with the code (potentially by adding additional log statements) or running individual code snippets to resolve the discrepancies.

wrt answers, I don't know either. Again, I want to reiterate that this is not a class. I don't know the answers. You have to find out the answers and tell me.

corinne-hcr commented 3 years ago

Outcome:

Due to unknown reason, I test bins/clusters on all data and bins/clusters above cutoff points separately. I also change a dataset with more data.
Regarding the original min_cluster = 2, I find some materials online that also start with k=2. I think that clustering means at least it divide one big group into two clusters.
For max_cluster, she used 1.5 * bins. I think that might be an empirically setting, but based on the implementation, the final number of clusters never hit the max.
For kmedoids, I think we can delete it. We never use this in implementation, I think she just left an option here.
About bins and cluster

on all data DEBUG:root:There are 88 bins before filtering DEBUG:root:number of clusters: 33 Bins are based on the distance of start points and end points of two trips. Clusters are based on silhouette score. I add some logs on the server code. Here is how silhouette score works to determine the clusters. So, we can see here, (the original max she set was 2) if sil > max, max will be set to equal to sil. As tests keep going, the sil reach a peak, and then decrease. When cluster = 33 is the peak of sil score. That how sil score finds the best cluster for all data. There is r = max_clusters - min_clusters+1 in clusterfunction infeaturization.py. Since min_cluster = 2, max_clusters = 1.5*bins, r is the test times from 2 to max_clusters.

Here is the trip labels after clustering

on data above cutoff only

For bins DEBUG:root:After removing trips that are points, there are 154 data points DEBUG:root:number of bins before filtering: 88 DEBUG:root:the new number of trips is 52 DEBUG:root:the cutoff point is 6 DEBUG:root:number of bins after filtering: 6

DEBUG:root:The list of bins is [[9, 14, 23, 25, 32, 34, 39, 53, 67, 69, 71, 87, 92, 112, 114, 128, 149, 151, 153], [10, 15, 24, 26, 33, 35, 40, 58, 68, 70, 88, 93, 113, 115, 150, 152], [5, 18, 77, 80, 142], [11, 16, 74, 131], [22, 64, 109, 130], [52, 61, 111, 134]]

For clusters Instead of following the original code min = bins, I set min = 0, so I can see if sil score also finds the same number of clusters.

Here is what I have DEBUG:root:number of clusters: 6

So, given the same amount of trips, sil score also find the same number of clusters, and the trip labels distribution is exact the same as the list of bins (same order and same number of trips in one cluster/bin).

Why she used clustering after binning I think the one of reasons could be that sometimes elbow method is hard to find the best k. Some online materials state that sil score is better and more straight forward to find the best k. Here she put common trips bins into cluster can examine if she got the best clusters.

-Why not just use clustering I think, the problem is, I don't see a method to find common trips after clustering(if we put all data into analysis, how to find 6 common trips clusters out of 33). The elbow method in clusters is just to find the k in given data. So that she used cutoff point in bins can be considered a good way to find common trips.

Next, I am going to test on different users and see how all data result in bins and cluster. Also, I will visualize some part of the bins and clusters.

shankari commented 3 years ago

@corinne-hcr this is very good. Couple of clarifications:

"I also change a dataset with more data." I think you mean that you changed to a user with more data. A dataset typically contains data from multiple users. So a new dataset would be data from a completely different source.
"For max_cluster, she used 1.5 * bins. I think that might be an empirically setting, but based on the implementation, the final number of clusters never hit the max." Again, would be good to verify at least against all the users we have before making that bold statement.
"For kmedoids, I think we can delete it. We never use this in implementation, I think she just left an option here." Great that she left the option there. That allows us to compare bins v/s clusters, by actually evaluating bins v/s kmeans clusters v/s kmedoid clusters
"So, given the same amount of trips, sil score also find the same number of clusters." As we discussed, you should verify this finding against multiple users as well. In particular, when I integrated Naomi's code into the system, I found that there were differences sometimes https://github.com/e-mission/e-mission-server/commit/d6a41c681d2eb4bcde983e0056128dcc97d99de0
"I think, the problem is, I don't see a method to find common trips after clustering(if we put all data into analysis, how to find 6 common trips clusters out of 33). " I think what you are saying is that clustering will put all the trips into one of the k clusters. So we will not be able to split into common and novel. Good finding! 👏

shankari commented 3 years ago

As you have outlined, the next tasks are:

Test on different users and see how all data result in bins and cluster.
Also, I will visualize some part of the bins and clusters and see if they match 1:1 in terms of trips
- visualizing on a map with different colors will help if we are considering 1 or 2 users
- for an overall picture, you might want to actually compare the trip ids in the bins and the clusters to determine whether the bins and clusters are identical or not

shankari commented 3 years ago

@corinne-hcr "I think, the problem is, I don't see a method to find common trips after clustering(if we put all data into analysis, how to find 6 common trips clusters out of 33)."

Given this, does it matter that we perform steps in the pipeline order by binning and then clustering. If binning and clustering are largely independent, would it work to cluster first, use the silhouette score to determine k and choose that as the elbow?

corinne-hcr commented 3 years ago

Update: Test on different users and see how all data result in bins and cluster. But I will actually put all data and above cutoff data results here since I notice there is something we can discuss.

user1: -all data DEBUG:root:There are 110 bins before filtering DEBUG:root:number of clusters: 14 -above cutoff only DEBUG:root:number of bins after filtering: 16 DEBUG:root:number of clusters: 16

user2: -all data DEBUG:root:There are 37 bins before filtering DEBUG:root:number of clusters: 8 -above cutoff only DEBUG:root:number of bins after filtering: 9 DEBUG:root:number of clusters: 9

user3: -all data DEBUG:root:There are 143 bins before filtering DEBUG:root:number of clusters: 62 -above cutoff only DEBUG:root:number of bins after filtering: 14 DEBUG:root:number of clusters: 13 Note that there are actually 14 cluster labels [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 11, 11, 11, 11, 11, 11, 0, 0, 0, 0, 0, 8, 8, 8, 8, 8, 5, 5, 5, 5, 5, 10, 10, 10, 10, 7, 7, 7, 7, 4, 4, 4, 9, 9, 9, 2, 2, 2, 6, 6, 6, 12, 12, 12] The sil score decreases from 14 clusters

DEBUG:root:sil is 0.950153323026658
DEBUG:root:sil > max True
DEBUG:root:sil is 0.950153323026658, max is 0.9283151017642491
DEBUG:root:The new max is 0.950153323026658
DEBUG:root:testing 14 clusters
DEBUG:root:sil is 0.9422369415328851

user4: -all data DEBUG:root:There are 32 bins before filtering DEBUG:root:number of clusters: 13 -above cutoff only DEBUG:root:number of bins after filtering: 2 DEBUG:root:number of clusters: 2

user5: -all data DEBUG:root:There are 196 bins before filtering DEBUG:root:number of clusters: 2 -above cutoff only DEBUG:root:number of bins after filtering: 17 DEBUG:root:number of clusters: 17

user6: -all data DEBUG:root:There are 71 bins before filtering DEBUG:root:number of clusters: 8

-above cutoff only DEBUG:root:number of bins after filtering: 9 DEBUG:root:The list of bins is [[1, 3, 12, 17, 23, 25, 32, 39, 41, 44, 47, 72, 104, 108], [4, 16, 19, 24, 26, 33, 40, 43, 46, 48, 78, 103, 107, 110], [28, 45, 100, 101, 102, 105, 106, 109, 123], [27, 75, 79, 82, 99, 113, 122], [2, 14, 81, 83, 98], [13, 42, 80, 120], [7, 20, 30], [9, 35, 62], [73, 77, 114]]

DEBUG:root:number of clusters: 6 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 5, 5, 5, 5, 5, 5, 5, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 2, 2, 2]

I think here we should plot it in map to see why different bins would be assigned same labels(clusters) Same as user 3

user7: -all data DEBUG:root:There are 88 bins before filtering DEBUG:root:number of clusters: 33 -above cutoff only DEBUG:root:number of bins after filtering: 6 DEBUG:root:number of clusters: 6

user8: -all data DEBUG:root:There are 268 bins before filtering DEBUG:root:number of clusters: 4 -above cutoff only DEBUG:root:number of bins after filtering: 46 DEBUG:root:number of clusters: 44

user9: -all data DEBUG:root:There are 164 bins before filtering DEBUG:root:number of clusters: 87 -above cutoff only DEBUG:root:number of bins after filtering: 27

DEBUG:root:number of clusters: 26 [0, 0, 0, 0, 0, 0, 0, 7, 7, 7, 7, 7, 7, 2, 2, 2, 2, 2, 2, 6, 6, 6, 6, 6, 6, 12, 12, 12, 12, 12, 12, 14, 14, 14, 14, 14, 14, 21, 21, 21, 21, 21, 13, 13, 13, 13, 13, 1, 1, 1, 1, 19, 19, 19, 19, 20, 20, 20, 24, 24, 24, 16, 16, 16, 3, 3, 3, 17, 17, 5, 5, 4, 4, 18, 18, 9, 9, 15, 15, 6, 6, 23, 23, 10, 10, 25, 25, 22, 22, 11, 11, 8, 8] See label 6, same thing here

user 10: -all data DEBUG:root:There are 102 bins before filtering DEBUG:root:number of clusters: 35 -above cutoff only DEBUG:root:number of bins after filtering: 21 DEBUG:root:number of clusters: 21

user11: -all data DEBUG:root:There are 132 bins before filtering DEBUG:root:number of clusters: 27 -above cutoff only DEBUG:root:number of bins after filtering: 14 DEBUG:root:The list of bins is [[12, 22, 49, 51, 53, 63, 72, 175], [13, 25, 38, 54, 73, 139, 144], [24, 35, 37, 90, 127, 143, 172], [7, 17, 27, 29, 118, 165], [78, 111, 133, 150, 201], [8, 28, 119, 166], [23, 36, 52, 126], [32, 87, 132, 146], [56, 92, 97, 117], [112, 134, 151, 202], [113, 182, 189, 195], [26, 140, 145], [50, 62, 64], [131, 154, 161]]

DEBUG:root:number of clusters: 8 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 2, 2, 2, 2, 1, 1, 1, 1, 5, 5, 5, 5, 6, 6, 6, 6, 0, 0, 0, 0, 2, 2, 2, 2, 4, 4, 4, 1, 1, 1, 7, 7, 7]

user12 skip

user13: -all data DEBUG:root:There are 303 bins before filtering DEBUG:root:number of clusters: 185 -above cutoff only DEBUG:root:number of bins after filtering: 11 DEBUG:root:number of clusters: 11

As we can see from the output above, bins are more sensitive than clusters. If we directly put all data in cluster then use elbow method, we may not find the common trips as expected. Of course, I need to plot some special cases where bins are significantly different from cluster.

Second concern, for data above cutoff, there are some example show that bin_trips clusters are less than the number of bins. I need to plot those on map to see if bins wrongly put similar trips into different bins or clusters incorrectly label the trips.(Maybe I will just need to plot those trips have same label but in different bins)

Note that I change min = len(bins) to min = 0 in clusters above the cutoff only. Naomi used min = len(bins), then bins clusters above cutoff will be the same.

shankari commented 3 years ago

all that sounds good. My only other suggestion would be to capture the values here {all trips: bins, cluster, above cutoff: bins, cluster} in a data structure (maybe a simple data frame) and plot it. We can then include that plot in the report along with the maps that visualize the discrepancies (user3, user6,...)

shankari commented 3 years ago

also, don't forget to push up your changes to the PR after that is done!

corinne-hcr commented 3 years ago

Do you mean use pandas data frame to show the values?({all trips: bins, cluster, above cutoff: bins, cluster}) Do you need the notebook now to see the difference? I didn't make many modifications. Maybe I push it later when I done with the map?

shankari commented 3 years ago

Yes, you should push after you collate the values and plot maps for the outliers "pandas data frame to show the values" you can even plot them using DataFrame.plot!

corinne-hcr commented 3 years ago

Partially update: the difference between trips in the same bin and trips in the same cluster

On all data Here I choose user2 as an example DEBUG:root:There are 37 bins before filtering DEBUG:root:number of clusters: 8

Here is the map for bins

Here is the map for clusters.

As we can see, in the cluster (label 1), the trips are largely different (red lines). However, those trips are in different bins. On all data, bins make more sense than clusters.

On above cutoff only Here I choose user 11 as an example DEBUG:root:number of bins after filtering: 14 DEBUG:root:number of clusters: 8

Here is the bins list DEBUG:root:The list of bins is [[12, 22, 49, 51, 53, 63, 72, 175], [13, 25, 38, 54, 73, 139, 144], [24, 35, 37, 90, 127, 143, 172], [7, 17, 27, 29, 118, 165], [78, 111, 133, 150, 201], [8, 28, 119, 166], [23, 36, 52, 126], [32, 87, 132, 146], [56, 92, 97, 117], [112, 134, 151, 202], [113, 182, 189, 195], [26, 140, 145], [50, 62, 64], [131, 154, 161]]

I will compare the trips in the first 3 bins and the last 2 bins with trips in label 1

Here is the trips from the first 3 bins

Here is the trips from the last 2 bins

Here is the label list [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 6, 6, 6, 6, 6, 6, 3, 3, 3, 3, 3, 6, 6, 6, 6, 1, 1, 1, 1, 5, 5, 5, 5, 2, 2, 2, 2, 0, 0, 0, 0, 4, 4, 4, 1, 1, 1, 7, 7, 7]

Here is the trips in label 1

corinne-hcr commented 3 years ago

update the graph of bins and clusters (from data above cutoff) :

shankari commented 3 years ago

Actually, let me try and run the output myself before I merge

corinne-hcr commented 3 years ago

I mentioned there is some differences for clustering (min=0 and min=len(bins)). You don't have to run it right now. I will update it in a minute

shankari commented 3 years ago

When I try to run the viz_bins_clusters_above_cutoff notebook, I get an error

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-7-0f3d0ef3449a> in <module>
----> 1 bin_trips, bins = pipeline.remove_noise(trips, radius)

~/e-mission/e-mission-server/emission/analysis/modelling/tour_model/cluster_pipeline.py in remove_noise(data, radius)
     54     sim.bin_data()
     55     logging.debug('number of bins before filtering: %d' % len(sim.bins))
---> 56     sim.delete_bins()
     57     logging.debug('number of bins after filtering: %d' % len(sim.bins))
     58     return sim.newdata, sim.bins

~/e-mission/e-mission-server/emission/analysis/modelling/tour_model/similarity.py in delete_bins(self)
     97     def delete_bins(self):
     98         self.calc_cutoff_bins()
---> 99         for i in range(len(self.bins) - num):
    100             self.bins.pop()
    101         newdata = []

NameError: name 'num' is not defined

It looks like https://github.com/e-mission/e-mission-server/pull/792 introduced a regression in the cluster pipeline.

shankari commented 3 years ago

Ah looks like it is fixed in https://github.com/e-mission/e-mission-server/pull/794/files Thanks @corinne-hcr

shankari commented 3 years ago

I can in fact run the notebook through. Here's the graph that I get

corinne-hcr commented 3 years ago

I get this, too. I can't get the one on all data. It never stops.

shankari commented 3 years ago

All clusters also finished running for me

shankari commented 3 years ago

all bins	all clusters
92	17
37	8
160	47
36	17
206	2
76	8
97	32
311	4
171	4
122	2
132	27
324	192

corinne-hcr commented 3 years ago

Can you screenshot the full data graph? (with the index) thanks!!!

shankari commented 3 years ago

so it looks like another difference between the bin and cluster code is the time taken to run them. If you can add timing statements (https://stackoverflow.com/questions/29280470/what-is-timeit-in-python), I can re-run the notebooks on my laptop and determine the actual difference.

shankari commented 3 years ago

Can you screenshot the full data graph? (with the index) thanks!!!

isn't that just https://github.com/e-mission/e-mission-eval-private-data/pull/16#issuecomment-778003311

I also added the dataframe in https://github.com/e-mission/e-mission-eval-private-data/pull/16#issuecomment-778005050

corinne-hcr commented 3 years ago

I think it should look like

shankari commented 3 years ago

I think it should look like

that is not full data, but only above cutoff, right?

corinne-hcr commented 3 years ago

Yes, I can only get the cutoff on my laptop. But for all data, I should have set a similar data frame. Did you see the similar thing on all data? Should be a complete data frame with index(user name)

shankari commented 3 years ago

	above cutoff bins	above cutoff clusters (min = len(bins))	above cutoff clusters (min = 0)
user1	17	17	17
user2	9	9	9
user3	16	16	15
user4	2	2	2
user5	20	20	19
user6	6	6	3
user7	7	7	7
user8	28	28	26
user9	27	27	26
user10	25	25	24
user11	14	16	8
user13	14	14	14

shankari commented 3 years ago

Did you see the similar thing on all data? Should be a complete data frame with index(user name)

The dataframe for all data is at https://github.com/e-mission/e-mission-eval-private-data/pull/16#issuecomment-778005050 It is true that the index didn't get copy-pasted, but it is the same order as the cutoff - user1 ... user13 with user12 missing

shankari commented 3 years ago

This is better than a screenshot because you can copy-paste it and get a dataframe back without having to type out all the numbers :)

{'all bins': {'user1': 92,
  'user2': 37,
  'user3': 160,
  'user4': 36,
  'user5': 206,
  'user6': 76,
  'user7': 97,
  'user8': 311,
  'user9': 171,
  'user10': 122,
  'user11': 132,
  'user13': 324},
 'all clusters': {'user1': 17,
  'user2': 8,
  'user3': 47,
  'user4': 17,
  'user5': 2,
  'user6': 8,
  'user7': 32,
  'user8': 4,
  'user9': 4,
  'user10': 2,
  'user11': 27,
  'user13': 192}}

corinne-hcr commented 3 years ago

OK thanks!

shankari commented 3 years ago

Once you address the fit_bounds, I am ready to approve and merge.

shankari commented 3 years ago

@corinne-hcr merged. I squashed before merging to avoid lots of back and forth on the code redesign

e-mission / e-mission-eval-private-data

Bins and clusters #16