Alternative to scree plot and score plot for K selection?

jpfontenelle commented 3 years ago

Hello everyone.

I have a question that is similar to previous issues posted here and are already closed.

Using pcadapt to identify the optimal number of principal components (K) is part of a pipeline of simulations I am running. The use of scree plots and score plots work well, it is "easy" to choose K based on the graphical representation.

However, my simulations have many replicates, which would mean > thousands of scree/score plots to inspect.

I wonder if there is any threshold that could be used to "select" K values without the need of the graphical interface.

I ran into this approach called Angle Distribution of Loading Subspaces (ADLS), that seems promising. However, my math skills are not good enough to code for this based on a pcadapt object.

From that line of thought I wonder if I could use the difference between singular.values. For example, would it be adequate to compare pcadapt.obj$singular.value[i] - pcadapt.obj$singular.value[i+1], pcadapt.obj$singular.value[i+1] - pcadapt.obj$singular.value[i+2], etc, and if the result is smaller than a number it would mean the "elbow" of the scree plot? Or is that too off?

Any ideas?

Thank you very much

privefl commented 3 years ago

Yes, visualizing both the scree plot and the score plots is the way I would recommend for choosing K.

Choosing K programmatically is a hard problem. When p > n, I would recommend reading this paper, whose method is implemented in R package hdpca. I've tried it in the past, it doesn't give the perfect K, but something often close enough. But you can't have too many individuals in the PCA because this method requires to compute all eigenvalues.

jpfontenelle commented 3 years ago

Hello. Thank you for the reply. I will definitely check the reference out. It might work on my case. Cheers

bcm-uga / pcadapt

Alternative to scree plot and score plot for K selection? #71