Evaluate the current clustering based on start/end place alone

shankari commented 3 years ago

[x] Visualize the existing evaluation plots in a notebook
[x] Call the existing evaluation does not use labels (uses silhouette score) and plot the results
[x] Determine metrics for evaluating common place clustering using labels
[x] Implement the metrics for accuracy and user intervention
[ ] Visualize the implemented metrics using notebooks
[x] Tune the common place clustering for tradeoffs between accuracy and user intervention-
[x] Generate final settings

corinne-hcr commented 3 years ago

Shankari, I have many questions about the tasks. I don't understand fully during the meeting.

1.As for querying the user about novel trips, is it right after the clustering the common places to form common trips? Though I might not be able to get to that point, I am curious about that.

2.Are these two tasks similar to what Naomi did in her research? I saw the steps in the tasks are kind of similar to the paper. Hers get to tour graph, mine is common trips. It looks like a step before tour graph.

3.According to our internship plan, 4/15 is the deadline for the poster. So, it sounds like most of the work has to be done before that day. A feasible schedule is fine. I hope I can finish the two tasks, but I don't have an idea about what I can do and what I can't do base on the steps. I still have questions about them. That means, for now I don't know what I am going to face. I might need to get to the specific step to figure it out. Due to lack of experience, I need more guides.

4.For the first task, Evaluate the current common place clustering

(1)Move the existing evaluation that does not use labels (uses silhouette score) out of computation code

I want to know where to begin with? I always feel confused about which file to look at. Should I look at all files in tour_model? It looks like I need to have cluster knowledge before starting it.

(2)I don't understand. Can you expand it?

(3)What do you mean user intervention here?

(5)What do you mean tradeoffs here?

I think the steps are vague to me. I need more information to see what to do and how I can catch up on the gap.

shankari commented 3 years ago

1.As for querying the user about novel trips, is it right after the clustering the common places to form common trips? Though I might not be able to get to that point, I am curious about that.

No, the plan is that we will cluster common places, and then we will cluster the trips between those places to pre-generate common trip "buckets". A common trip may be "going from home to work on the #32 bus. Another one may be "going from home to work on the F light rail". As new trips come in, we will check them against the existing buckets and if they don't fit into the bucket, we will query the user.

You will only work on the first part (creating the buckets), which is the ML/analysis-heavy part. I (or maybe a summer intern) will implement the part where we match the incoming trips. That is integrating the ML algorithm into a system, and I don't think you will be able to finish it in this semester.

2.Are these two tasks similar to what Naomi did in her research? I saw the steps in the tasks are kind of similar to the paper. Hers get to tour graph, mine is common trips. It looks like a step before tour graph.

Naomi did the first task, but she put all trips between a pair of common places into the same bucket. The second task is differentiating between different ways of traveling between the same pair - i.e. putting the trips into more fine-grained buckets

wrt 3. I structured this as two separate and largely independent tasks so that we would have a backup plan if you are not able to finish both. If you only finish the first task, the project will be "evaluation of clustering algorithm to determine common places using labeled and unlabeled metrics". If you finish both tasks, the project will be "Using machine learning to classify common and novel trips from longitudinal labeled travel patterns"

4.For the first task, Evaluate the current common place clustering

(1)Move the existing evaluation that does not use labels (uses silhouette score) out of computation code I want to know where to begin with? I always feel confused about which file to look at. Should I look at all files in tour_model? It looks like I need to have cluster knowledge before starting it.

$ grep -r score emission/analysis/modelling/
...
emission/analysis/modelling//tour_model/cluster_groundtruth.py:from sklearn.metrics.cluster import homogeneity_score, completeness_score
emission/analysis/modelling//tour_model/cluster_groundtruth.py:    b = homogeneity_score(colors, labels)
emission/analysis/modelling//tour_model/cluster_groundtruth.py:    c = completeness_score(colors, labels)
emission/analysis/modelling//tour_model/featurization.py:                sil = metrics.silhouette_score(numpy.array(self.points), numpy.array(self.labels))
emission/analysis/modelling//tour_model/featurization.py:                sil = metrics.silhouette_score(numpy.array(self.points), self.labels)
emission/analysis/modelling//tour_model/featurization.py:        logging.debug('silhouette score is %s' % self.sil)

You should move the score calculation out into its own file since the score is not required while featurizing. I will move out the scoring from either featurization or cluster_groundtruth as an example for you to follow.

It looks like I need to have cluster knowledge before starting it.

All of them evaluate the machine learning technique of "clustering". sklearn (which is what they use) is a python machine learning library with a lot of machine learning algorithms (e.g. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html)

(2)I don't understand. Can you expand it?

We can visualize the evaluation along a couple of axes:

first, we can and should visualize the clusters created as a sanity check of how they work. There are no formal metrics for this at this time, although we will come back to this after we have the metrics in place.
- sklearn has many examples of how to visualize the input dataset into clusters or classes. Think of the places as points in 2-D space, which is what sklearn uses for the visualization as well
- Naomi also has code for this (the map_clusters_by_groundtruth function), although it uses pygmaps and saves the output to a file.
- If we want to re-use that, we should move that out to a notebook and switch to folium so that we can visualize more easily and interactively.
next, we should compute the scores and generate a distribution of the scores with the current default settings
next, we should tune the settings and visualize the distributions of scores, ideally multiple plots in one graph

This will show us what value of the settings works well for the existing metrics

(3)What do you mean user intervention here?

How many novel places there are. Assuming we need to ask the user for every trip to a novel place, how often would be interrupt the user?

(5)What do you mean tradeoffs here?

Between accuracy and user attention. If we ask the user for every trip, we are guaranteed to be 100$ accurate but with a large user burden. If we ask the user for no trips and only use the automated inference that Ishrat is working on, we would have no user burden but would not get all the data we want. We want to find a nice balance between them.

shankari commented 3 years ago

If we check the cluster pipeline, remove_noise which calls similarity.similarity happens before featurization. So let's convert that first.

shankari commented 3 years ago

@corinne-hcr We're going to move the viz code into the e-mission-eval-private-data repo to avoid cluttering up the server. Since we are clearly not going to monorepo any time soon.

First, we fix the setup on that repo (https://github.com/e-mission/e-mission-eval-private-data/pull/12, https://github.com/e-mission/e-mission-eval-private-data/pull/13)

Next, we pull out the similarity viz code (https://github.com/e-mission/e-mission-eval-private-data/pull/14) from (https://github.com/e-mission/e-mission-server/pull/791)

shankari commented 3 years ago

Note that my recollection of the prior work was a little faulty. Naomi did not cluster places directly. Instead, she clustered trips, but only based on the start and end points. So the two stages are now:

evaluate current clustering based on only start/end points
add other features including start/end time, trajectory etc.

I have fixed the title of this bug to match.

shankari commented 3 years ago

@corinne-hcr from the visualization, the current cutoff point seems a bit off to me. Can you change this to plot graphs for the other users as well, and see if this is a common theme? If so, we may need to fiddle with the cutoff detection code.

Thanks! Also, feel free to post questions if there is anything about the PR that you don't understand

corinne-hcr commented 3 years ago

plot graphs for all users https://github.com/e-mission/e-mission-eval-private-data/pull/15 Here is the graph for all users.

shankari commented 3 years ago

Great. So now let's plot the cutoffs for multiple users from the CanBikeCO dataset and see visually whether the cutoff point seems reasonable. If not, we may want to see why the cutoff point is not reasonable and to tweak the settings/implementation accordingly.

shankari commented 3 years ago

After this, we will use featurization and then representatives. That is pretty much all.

corinne-hcr commented 3 years ago

Great. So now let's plot the cutoffs for multiple users from the CanBikeCO dataset and see visually whether the cutoff point seems reasonable. If not, we may want to see why the cutoff point is not reasonable and to tweak the settings/implementation accordingly.

Here is the updated graph for multiple users. plot graphs for all users #15

shankari commented 3 years ago

Next, we should evaluate these bins and see whether they make sense. Before we go down the path of tweaking the cluster radius, though, I want to understand the difference between the binning output and the clustering output. If we have already binned trips based on start and end, what does the clustering do?

I think we should also be able to plot the various bins and clusters for a specific UUID.

@corinne-hcr how long will it take you to:

outline the difference between bins and clusters
plot the trips in each of them (which will help you with the previous task)

Note that there is already an implementation of map_clusters in emission/analysis/modelling/tour_model/featurization.py. You just need to change it from google maps to folium.

shankari commented 3 years ago

@corinne-hcr do you have an estimate of how long it will take you to complete these tasks?

corinne-hcr commented 3 years ago

Before Tuesday?

shankari commented 3 years ago

@corinne-hcr I would actually like you to time bound each of the tasks and and the end of the time bound, post a PR of whatever you have for me to finish up. I do want you to get familiar with the code, but I also need to make sure that we are actually making progress.

So how about one day each:

one day to plot the trips in the bins and clusters (due Monday at 10am PST)
one day to elucidate the difference between them (due Tuesday at 10am PST)

corinne-hcr commented 3 years ago

That sounds good. I will try my best to finish it. If I face some problem, it might take a bit longer than expected, but I will let you know what I have by then.

shankari commented 3 years ago

@corinne-hcr is waiting for @shankari to run the clustering code from a notebook so we can evaluate it While waiting, @corinne-hcr moves on to the next task: "Call the existing evaluation that does not use labels (uses silhouette score) and plot the results"

You are encouraged to add additional logging statements to the server code. In order to see log debug statements in the notebook, you need to enable logging, similar to https://stackoverflow.com/questions/18786912/get-output-from-the-logging-module-in-ipython-notebook

corinne-hcr commented 3 years ago

Here is the link of visualization of bins and clusters. https://github.com/e-mission/e-mission-eval-private-data/pull/16

shankari commented 3 years ago

Next set of changes:

read up on precision, recall and F-score
- @corinne-hcr's idea on accuracy % of matching labels is precision
- these metrics may be trickier for clustering, e.g. https://stats.stackexchange.com/questions/15158/precision-and-recall-for-clustering)
compute precision, recall and F-score for both bins and clusters
- note that you will be able to do this only for a subset of the trips which do have labels
- you will need to figure out how this works with multiple labels

shankari commented 3 years ago

Determine metrics for evaluating common place clustering using labels
- (DONE) Determined accuracy for bins
- Extend to precision, recall and F-score
Use metrics on other scenarios
- bins above cutoff
- clusters above-cutoff using clustering algorithm that does not depend on bins
Implement the metrics for accuracy and user intervention
- (DONE) Implemented accuracy (precision)
- Need to implement the other metrics
- Need to come up with metrics for user intervention
  - This is likely to be as simple as how many trips per month in below cut-off bins
Visualize the implemented metrics using notebooks
- we already have the accuracy, can put into a graph
- once we get the number of trips, we can also put that into a graph

shankari commented 3 years ago

We can consider this problem through both a clustering and a classification lens. Although the prior work has primarily included clustering, we can also now that we have labeled trips, consider the trips as belonging to one of several classes and use the standard classification precision and recall metrics.

I believe that the reason this doesn't work is because of the featurization of the trips. For classification ML methods, we typically need a feature matrix with one entry per data point and feature. For trip start/end, though, it is not clear that putting the raw lat/lon into the feature matrix will work; it doesn't take noise into account. What we really need is the distance from other trips but that is a 2-D (trip x trip) matrix in itself.

Which is why we are using clustering now. When we move to the second phase of the project, we can and probably should use more standard classification algorithms, or at least we should try them as well.

In that case, we would follow a two step approach in which we first cluster/bin the trips, and then put the bin or cluster number into the feature matrix. This should work as long as the bin numbers or cluster numbers are stable; if they are not, we may need to use the centroid or a geo region or just use clustering after all.

shankari commented 3 years ago

While waiting for the F-score determination, let us continue exploring the data and tuning it based on the current precision/accuracy metric.

While tuning, we typically experiment with different featurization options and different input tuning parameters to the ML algorithms. In our case:

different tuning parameters are:
- all bins v/s only above cutoff
- similarity distance (currently 300)
different algorithm types are:
- binning
- clustering

We only have one featurization which is the distance between (start, end) pairs

As part of tuning, you will want to run for all combinations for the tuning parameters and see if there are any paths that are noticeably better than others. Please note that these results will go into the final paper, at least in summary form.

Note that scikit-learn has built-in techniques for evaluation/validation - e.g. https://scikit-learn.org/stable/modules/model_evaluation.html

Our evaluation may not fit as well into their examples, but it might if you try hard enough. Alternatively, we can implement a parallel method. But evaluation/validation is a key component to model selection.

shankari commented 3 years ago

Since we are using clusters, but want to calculate precision and recall, we should look at algorithms and evaluation metrics for labeled clusters. That is likely to be closest to what we want.

shankari commented 3 years ago

Although we can manually compute the precision and recall and F score for labeled clusters, sklearn includes built-in metrics that work with labeled clusters (https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation)

These include metrics 2.3.10.1. to 2.3.10.4.

In general, we should use existing metrics rather than re-implementing new ones because it is better to not reinvent the wheel.

I think we should use the existing labeled cluster metrics instead of implementing F-score from scratch. The V-score seems to be pretty close to what we want, and it is already implemented!

F-score is much better known than the labeled cluster metrics, so that's what came to my mind first. I apologize for any confusion!

As I pointed out earlier, you could see if we can also re-use the model evaluation methods from scikit-learn for the tuning.

corinne-hcr commented 3 years ago

The precision on all data is here

corinne-hcr commented 3 years ago

query times subplots and evaluation code for v-score

e-mission / e-mission-docs

Evaluate the current clustering based on start/end place alone #605