Open lilyminium opened 4 years ago
This also results in issues for ensemble similarity analysis. The outlier "cluster" is treated like a real cluster. Therefore, if a conformation in trajectory A is in the outlier cluster and a conformation in trajectory B is in the outlier cluster, it is treated as a point of similarity -- in reality these conformations should be unrelated.
We will update the documentation and code to add a warning when DBScan is being used - so that users are aware of this issue
Expected behavior
DBSCAN is a clustering method that can identify outliers. I expect these outliers to be clearly indicated in some way. I also expect that outliers are treated properly in similarity measures.
Actual behavior
The method implemented in MDAnalysis makes the outlier group (label==-1) look like an actual cluster (labels start from 0). (although this doesn't ultimately matter, as
encode_centroid_info
drops label information anyway)https://github.com/MDAnalysis/mdanalysis/blob/9bcf6f4c118e1ea137e8514bd60cbd1cd1972062/package/MDAnalysis/analysis/encore/clustering/ClusteringMethod.py#L300-L306
Also, calling the first frame in the cluster the centroid, and not mentioning this very clearly in the docs seems like a bad idea. This also gives the outlier group a centroid.
Finally, ClusterCollection does not keep the cluster labels. This makes it hard to look for special (i.e. negative) cluster labels.
Currently version of MDAnalysis
python -c "import MDAnalysis as mda; print(mda.__version__)"
) 0.20.2-devpython -V
)?Possible fix
Easy option
ClusterCollection
) and add a warning that it's not a real clusterMore work option