Closed shankari closed 3 years ago
@corinne-hcr @GabrielKS, FYI. I plan to run a similar analysis, focusing on the selected radius of 500, on the other datasets tomorrow, and also see what changes wrt clustering with radius = 500.
Seeing no reviews, I plan to merge this now.
Since we have labeled data, instead of starting with the locations, clustering, and then finding a homogeneity score, we start with clusters that we expect to have good coherence properties, and experiment with various radii.
Through this ad-hoc experimental approach, which includes comparing the points on a map and making a judgement call on "I know it when I see it", we come up with a radius of 500 meters.
This gives fairly coherent clusters (% user trips in some cluster ranges between 40% and 90%) when filtering by purpose before clustering.
It gives even more awesome clusters (% user trips in some cluster is consistently above 80%) when clustering without filtering by purpose. This indicates that there are clusters with mixed purpose labels. So either there really is a lot of ambiguity in the dataset, or people just made a lot of errors while labeling.
We can identify users with what appear to be a lot of labeling errors and follow up with them if possible. But 40% is not too shabby.
Next steps: