aj-grant / navmix

2 stars 2 forks source link

[Question] Effect of sample size #3

Open mglev1n opened 9 months ago

mglev1n commented 9 months ago

The authors of a recent preprint using ClustImpute to cluster diabetes-associated variants based on associations with cardiometabolic traits recommend using a sample-size adjusted Z-score for clustering ($Z=\beta/(\sqrt{N}*SE)$) to "provide more uniform weighting": https://www.medrxiv.org/content/10.1101/2023.03.31.23287839v1. While the NAvMix paper suggests the impact of sample size should be relative modest, empirically this additional transformation does seem to influence the number and composition of clusters that are identified when effective sample sizes vary substantially across traits. I'm curious whether this additional transformation may be reasonable as an alternative primary or sensitivity analysis?

aj-grant commented 8 months ago

Thanks for the question! I think whether and how to standardise the inputs into a clustering method will generally best be decided on a case-by-case basis. If there is substantial differences in sample sizes for different traits, then using non-adjusted Z-scores you will be less likely to form clusters which are distinguished by the trait with smaller sample size. This could either be desirable or not, depending on how conservative you wish to be (eg, smaller sample sizes means increased uncertainty in the estimates, and you may wish to put more weight into clustering on traits with lower uncertainty). In our simulation study in the NAvMix paper, we looked at scenarios where sample sizes varied and didn't see substantial differences in overall performance. But I think looking at the alternative transformation as a sensitivity analysis for a particular setting, and to try to understand any major differences, could definitely be a useful thing do to. Hope that helps!