Closed pschloss closed 10 months ago
Hi @pschloss,
Thanks for bringing this up and for your patience, for some reason I did not see a notification. Good question!
The short answer is that n-components=3
has worked well as a default in practice (now that the method has been out for a while), if you are unsure then that is a good number to use (and only what most people even look at when exploring ordination results). Going too high (>10) can lead to issues including an explosion in run time. It is not that the rank/n-components correspond to the major treatment group but rather the number of factors that might contribute to the difference between samples (similar to PCA components or PCoA coordinates).
Unfortunately, the auto-rpca
method we added based on this paper section "C. Rank estimation algorithm", which tries to choose the number of components based on approximate the explained variance (similar to how many choose n-components in PCoA or PCA), does not seem to work well in all cases (e.g., the issue you put in on Gemelli). So we are deprecating the auto-rpca
function in the upcoming versions and I would not suggest it be used. We are working on replacing auto-rpca
with new and hopefully more robust estimation methods (it is not a simple thing to solve).
While pre-filtering has been generally beneficial for this kind of analysis, especially filtering by the number of observations across samples and not by the sum across samples, has also helped (data trimming is common for these types of algorithms). There are a lot of varying needs based on studies, opinions, and contradictory information on pre-filtering so we removed it by default in Gemelli so the user can make that decision explicitly.
The proportion explained in RPCA ordinations will always sum to one across however many components are chosen, so I would not read into that too much other than the relative importance of an axis within an ordination. Similarly, the distances are calculated from the Euclidean distance of the sample ordination and itself, so it is not surprising that it changes as the number of components changes.
I hope this helps!
I was looking back through the code for your simulations in the mSystems paper trying to understand how I would use gemelli with my own data. I'm wondering whether you could clarify for me how the
rank
/n_components
argument in deicode/gemelli is supposed to be used.In your Gemelli tutorials, I noticed that the default value of
n_components
is being used (i.e., 3) in the IBD tutorial when there are two groups (i.e., control and crohn's) and in the Moving Pictures it is usingauto-rpca
when there isn't a clear number of groups to use.However, when I looked through the code for your case studies, it appears that you're using
rank=2
for the Sponge andrank=3
for the Sleep Apnea datasets when runningdeicode_rpca
(see here). The rank for the Sponge dataset corresponds with the two levels ofhealth_status
, but I'd expect the rank for the Sleep Apnea dataset to be 2 since there are two levels in theexposure_type
column of the metadata file. Meanwhile, in the negative and positive control simulations you use the default when running rclr and OptSpace although I'd expect it to berank=2
.When I compared the output using different
n_components
values for the same dataset there does appear to be significant variation in the resulting distances, which makes me think that the value chosen is important. Here's an example (I noticed that the default value for--min-feature-count
changed between the paper andgemelli
, which was 10 and is now zero)...Here's a comparison of the distances with the distances with
--n-components 3
on the x-axis. Theauto-rpca
approach estimates the rank at 3.The first axis explains 84, 66, and 38% of the variation when using
--n-components
values of 2, 3, and 9, respectively.I think I've proven this to myself as I file this issue, but am I correct that we should be picking a
--n-components
value that we expect to correspond to the major treatment group if it is a cross-sectional study? I'm just a bit uneasy since something else seemed to be going on in the mSystems paper and for my example,auto-rpca
didn't seem to pick the correct rank.