MDAnalysis / mdaencore

Ensemble overlap comparison software for molecular data.
http://www.mdanalysis.org/mdaencore/
GNU General Public License v2.0
0 stars 0 forks source link

DBSCAN does not make outliers clear #33

Open lilyminium opened 4 years ago

lilyminium commented 4 years ago

Expected behavior

DBSCAN is a clustering method that can identify outliers. I expect these outliers to be clearly indicated in some way. I also expect that outliers are treated properly in similarity measures.

Actual behavior

The method implemented in MDAnalysis makes the outlier group (label==-1) look like an actual cluster (labels start from 0). (although this doesn't ultimately matter, as encode_centroid_info drops label information anyway)

https://github.com/MDAnalysis/mdanalysis/blob/9bcf6f4c118e1ea137e8514bd60cbd1cd1972062/package/MDAnalysis/analysis/encore/clustering/ClusteringMethod.py#L300-L306

Also, calling the first frame in the cluster the centroid, and not mentioning this very clearly in the docs seems like a bad idea. This also gives the outlier group a centroid.

Finally, ClusterCollection does not keep the cluster labels. This makes it hard to look for special (i.e. negative) cluster labels.

Currently version of MDAnalysis

Possible fix

Easy option

More work option

lilyminium commented 3 years ago

This also results in issues for ensemble similarity analysis. The outlier "cluster" is treated like a real cluster. Therefore, if a conformation in trajectory A is in the outlier cluster and a conformation in trajectory B is in the outlier cluster, it is treated as a point of similarity -- in reality these conformations should be unrelated.

mtiberti commented 12 months ago

We will update the documentation and code to add a warning when DBScan is being used - so that users are aware of this issue