Question about integration with DataMapPlot

TutteInstitute / evoc

Embedding Vector Oriented Clustering

BSD 2-Clause "Simplified" License

116 stars 5 forks source link

Question about integration with DataMapPlot #7

Closed zilch42 closed 2 months ago

zilch42 commented 3 months ago

Great work with this package, I'm just starting to experiment with it. Very Exciting!

Just wondering about plugging the clustered data into DataMapPlot. Will UMAP (or other) still be required to reduce higher dim vectors down to 2D to supply to data_map_coords separately? Or can evoc supply that too? Just thinking if evoc is doing some of what UMAP does anyway, is there some efficiency by not recalculating the dimension reduction separately? Or is it better for the user to have discrete control over the coordinates for the visualization?

Thanks

zilch42 commented 3 months ago

I guess one further comment to this, it would be ideal to have the visual placement of the points to align as close as possible to how the points group together within the clusters. So if the following...

UMAP(
    n_neighbors=15,
    n_components=2,
    min_dist=0.0,
    metric='cosine'
)

...is going to be as close as possible visually to whatever is happening inside this...

EVoC(
    n_neighbors = 15,
)

... then great. But if EVoC is grouping things different to how UMAP would, then it might be useful to also have coordinates coming out of EVoC.

lmcinnes commented 3 months ago

In practice you will want a separate UMAP run for feeding into DataMapPlot unfortunately. EVoC has a very custom approach to the effective dimension reduction step, and the results will be quite bad for visualization purposes (but work very well for clustering purposes). In general the clustering should align very well with UMAP results (with occasional stray points here and there), especially with the parameter choices you have above (although you can likely vary min_dist to something larger without much loss).

If you really need very good alignment with the UMAP result is is likely best to actually do your clustering on the UMAP output (and not with EVoC, which will do very weird things to that).

zilch42 commented 2 months ago

No worries, thanks!