How is `n_components` to be selected?

I was looking back through the code for your simulations in the mSystems paper trying to understand how I would use gemelli with my own data. I'm wondering whether you could clarify for me how the rank/n_components argument in deicode/gemelli is supposed to be used.

In your Gemelli tutorials, I noticed that the default value of n_components is being used (i.e., 3) in the IBD tutorial when there are two groups (i.e., control and crohn's) and in the Moving Pictures it is using auto-rpca when there isn't a clear number of groups to use.

However, when I looked through the code for your case studies, it appears that you're using rank=2 for the Sponge and rank=3 for the Sleep Apnea datasets when running deicode_rpca (see here). The rank for the Sponge dataset corresponds with the two levels of health_status, but I'd expect the rank for the Sleep Apnea dataset to be 2 since there are two levels in the exposure_type column of the metadata file. Meanwhile, in the negative and positive control simulations you use the default when running rclr and OptSpace although I'd expect it to be rank=2.

When I compared the output using different n_components values for the same dataset there does appear to be significant variation in the resulting distances, which makes me think that the value chosen is important. Here's an example (I noticed that the default value for --min-feature-count changed between the paper and gemelli, which was 10 and is now zero)...

gemelli rpca --in-biom case_studies/data/Sponges/table.biom --output-dir rank_2 --min-sample-count 1000 --min-feature-count 10 --n-components 2
gemelli rpca --in-biom case_studies/data/Sponges/table.biom --output-dir rank_3 --min-sample-count 1000 --min-feature-count 10  --n-components 3
gemelli rpca --in-biom case_studies/data/Sponges/table.biom --output-dir rank_9 --min-sample-count 1000 --min-feature-count 10  --n-components 9
gemelli auto-rpca --in-biom case_studies/data/Sponges/table.biom --output-dir rank_auto --min-sample-count 1000 --min-feature-count 10

Here's a comparison of the distances with the distances with --n-components 3 on the x-axis. The auto-rpca approach estimates the rank at 3.

comparison

The first axis explains 84, 66, and 38% of the variation when using --n-components values of 2, 3, and 9, respectively.

I think I've proven this to myself as I file this issue, but am I correct that we should be picking a --n-components value that we expect to correspond to the major treatment group if it is a cross-sectional study? I'm just a bit uneasy since something else seemed to be going on in the mSystems paper and for my example, auto-rpca didn't seem to pick the correct rank.

Hi @pschloss,

Thanks for bringing this up and for your patience, for some reason I did not see a notification. Good question!

The short answer is that n-components=3 has worked well as a default in practice (now that the method has been out for a while), if you are unsure then that is a good number to use (and only what most people even look at when exploring ordination results). Going too high (>10) can lead to issues including an explosion in run time. It is not that the rank/n-components correspond to the major treatment group but rather the number of factors that might contribute to the difference between samples (similar to PCA components or PCoA coordinates).

Unfortunately, the auto-rpca method we added based on this paper section "C. Rank estimation algorithm", which tries to choose the number of components based on approximate the explained variance (similar to how many choose n-components in PCoA or PCA), does not seem to work well in all cases (e.g., the issue you put in on Gemelli). So we are deprecating the auto-rpca function in the upcoming versions and I would not suggest it be used. We are working on replacing auto-rpca with new and hopefully more robust estimation methods (it is not a simple thing to solve).

While pre-filtering has been generally beneficial for this kind of analysis, especially filtering by the number of observations across samples and not by the sum across samples, has also helped (data trimming is common for these types of algorithms). There are a lot of varying needs based on studies, opinions, and contradictory information on pre-filtering so we removed it by default in Gemelli so the user can make that decision explicitly.

The proportion explained in RPCA ordinations will always sum to one across however many components are chosen, so I would not read into that too much other than the relative importance of an axis within an ordination. Similarly, the distances are calculated from the Euclidean distance of the sample ordination and itself, so it is not surprising that it changes as the number of components changes.

I hope this helps!

cameronmartino / deicode-benchmarking

How is `n_components` to be selected? #1