Possible divergence between matlab-HYDRA and pyHYDRA results

anbai106 / mlni

Machine Learning in NeuroImaging (MLNI) is a python package that performs various tasks using neuroimaging data.

https://anbai106.github.io/mlni/

MIT License

8 stars 7 forks source link

Possible divergence between matlab-HYDRA and pyHYDRA results #1

Closed IoannaSkampardoni closed 3 years ago

IoannaSkampardoni commented 3 years ago

Hi, I implemented pyHYDRA to derive subgroups in a sample of cognitively impaired subjects. The input features file consists of residualized ROI volumes after performing age- and intracranial volume- adjustement. I set k_min=2, k_max=3 and cv_repetition=20. For both k=2 and k=3, I am always getting one subgroup with substantially smaller size compared to the other subgroups. In addition, the subgroups and the ARIs differ from the matlab-HYDRA subgroups and ARIs, respectively. I attach a plot showing the residualized hippocampal volume (residual=original-adjusted) for each subgroup (P1,P2,P3) vs the control group (CN) for matlab-HYDRA and pyHYDRA as well as the subgroups' sizes and ARIs. Are those differences expected? Thank you. pyHYDRA vs matlab HYDRA

anbai106 commented 3 years ago

Thank you for this report.

The discrepancy between the Matlab and python version is due to the standardization method, I..e, min-max or z score. In the original Matlab implementation, we used z score, which is more robust against outliers.

However, when I implemented pyHYDRA, I switched it to min-max standardization, which scales all features to similar ranges. I did this because I found that min-max method is faster for the binary classification pipeline. Please refer to this commit: @c6c7c35.

Can you please git pull locally to have the newest version, and rerun your Matlab versus pyHYDRA experiments? Note that Matlab by default with: k=-20 is to run 20-folds CV, which is really rare, because you will have much less test subjects. Anyways, HYDRA is not really doing CV there. The CV is to have data vaiarance across each fold. If you want to have a fair comparison, please also perform 20-folds CV for pyHYDRA. By default, pyHYDRA performs 20 repeated hold-out CV.

Thank you

IoannaSkampardoni commented 3 years ago

Thank you for the explanation! I reran pyHYDRA using zscore, instead of min-max scaling, and the subtypes and ARIs look similar to those derived using the matlab HYDRA.

Regarding the CV, pyHYDRA implements cv_strategy =='hold_out' by default which is the sklearn-StratifiedShuffleSplit. What you said is that the matlab HYDRA implements the simple k-fold (the corresponding sklearn-StratifiedKFold, without re-shuffling and splitting in each iteration), therefore, to make the two HYDRA versions comparable, I need to select cv_strategy == 'k_fold' in pyHYDRA. Is that correct? Thank you!

anbai106 commented 3 years ago

Yes, that's correct.

Thanks

anbai106 commented 3 years ago

This issue has been resolved. Please reopen if necessary.

The comparison btw the Matlab version https://github.com/evarol/HYDRA and pyHYDRA has been verified in this issue, leading to similar subtyping results, as expected.

Moreover, this has already been compared previously. For SCZ, pyHYDRA has been applied to the OHBM abstract: https://www.researchgate.net/publication/346965816_Multi-scale_feature_reduction_and_semi-supervised_learning_for_parsing_neuroanatomical_heterogeneity. The results of the two subtypes are reproduced compared to the Brain paper https://academic.oup.com/brain/article/143/3/1027/5758311?login=true using Matlab version and the study population is exactly the same.