Add a notebook to explore various parameters for the radius

shankari commented 3 years ago

Since we have labeled data, instead of starting with the locations, clustering, and then finding a homogeneity score, we start with clusters that we expect to have good coherence properties, and experiment with various radii.

Through this ad-hoc experimental approach, which includes comparing the points on a map and making a judgement call on "I know it when I see it", we come up with a radius of 500 meters.

This gives fairly coherent clusters (% user trips in some cluster ranges between 40% and 90%) when filtering by purpose before clustering.

It gives even more awesome clusters (% user trips in some cluster is consistently above 80%) when clustering without filtering by purpose. This indicates that there are clusters with mixed purpose labels. So either there really is a lot of ambiguity in the dataset, or people just made a lot of errors while labeling.

We can identify users with what appear to be a lot of labeling errors and follow up with them if possible. But 40% is not too shabby.

Next steps:

run this on the other datasets for comparison
evaluate the first round of clustering with a radius of 500

shankari commented 3 years ago

% of trips that are in clusters without any grouping

Scatter plot for purpose v/s validity (color is for the user)

Summary at the user level

% of trips that are in clusters when first grouping by the purpose label

Scatter plot for purpose v/s validity (color is for the user)

Summary at the user level

shankari commented 3 years ago

@corinne-hcr @GabrielKS, FYI. I plan to run a similar analysis, focusing on the selected radius of 500, on the other datasets tomorrow, and also see what changes wrt clustering with radius = 500.

shankari commented 3 years ago

Seeing no reviews, I plan to merge this now.

e-mission / e-mission-eval-private-data