Choice of base ASV and influence on reproducibility

erikrikarddaniel commented 2 years ago

Hi,

A student of mine and me both feel that DivNet is a truly interesting tool for our ASV analyses. We have stumbled on the issue of finding a base ASV however. We have seen and read issue #14 but questions remain.

Lacking ASVs that occur in all samples, the automatic assignment of base ASV fails and finding one strikes us as something that might not be quite reproducible. Your advice in the issue is to find "the most variance stabilising base taxon" and that this probably is present in many samples and near the median abundance. How to get there is not entirely clear to us and, what concerns us most, is how this influences reproducibility of our analysis.
There's a not exported, but with help text, pick_base function that is apparently called by other function(s) but fails when there are no ASVs shared by all samples. For us, it would be great if this function always returned an ASV in a reproducible way even when there are no ASVs shared by all, and ideally finding an ASV that's at least close to the most variance stabilising one. Or, would that be perhaps be prohibitively expensive computation wise considering you might need to run the whole algorithm for many ASVs?
It would be great if pick_base was exported, so one could call the function separately.

I hope I'm not missing something obvious in the documentation or elsewhere, and that it's clear that this is more a question than a bug report or feature request (although could develop into the latter).

Thanks for your time and effort!

/Daniel

ailurophilia commented 2 years ago

Hi Daniel,

Thanks for reaching out! To address your last point first, we absolutely can export pick_base -- I'll release an update in the next day or so with this change.

The fact that pick_base throws errors when no taxon is shared across all samples definitely creates the need for more documentation of an analysis to ensure reproducibility. With that said, I want to make sure we are using the word 'reproducibility' in the same way – I think of this as referring to sufficient documentation of data, software, and analytical choices to allow another team of researchers to obtain the same numerical results as the group performing an original analysis. Is this the sense you mean as well?

Lastly, I'm not aware of any computationally efficient non-heuristic way of choosing a variance stabilizing taxon for divnet. I'll consult with @adw96 about changing the behavior of pick_base when no taxon is present in all samples, but my suspicion here is that the behavior of this function is intentional. That is, it may not be a great idea in this case for divnet to run without error because choice of base taxon may meaningfully impact results and so deserves some scrutiny.

I hope this helps, and I'll be in touch with updates in the coming days.

Best, David

erikrikarddaniel commented 2 years ago

Thanks for the reply, David,

I agree with your interpretation of "reproducibility". What I was hoping for was a pick_base implementation that could pick a base reproducibly lacking a shared taxon. If that's possible or not I don't know, but reading issue #14 gave me the impression that it might be possible. In my opinion, assigning a taxon reproducibly through a function in the package would be preferable even if it might not in all cases be the optimal, at least if it's close enough to the optimal.

Looking forward to hear from you again.

/Daniel

ailurophilia commented 2 years ago

Hi Daniel,

Thanks for clarifying – unfortunately, there is currently no such pick_base implementation (although pick_base is now exported).

My recommendation for ensuring reproducibility in this case would be to code/document your criteria for choosing a base taxon (e.g., among taxa present in at least 80% of samples, the taxon with highest median observed abundance across samples). In my understanding, we suggest fitting models with a few different base taxa when no taxon is shared across all samples essentially as a sensitivity analysis -- ideally, choice of base taxon, at least among reasonable candidate taxa, should have relatively little influence on fitted quantities, so if it does, that is one indication the model may not be particularly reliable in this instance.

I hope this is helpful!

Best, David

mooreryan commented 2 years ago

Here is a nice paper with a section about choosing a good reference component (https://doi.org/10.3389/fmicb.2021.727398) ...check out the section called "Criteria for Selecting the Reference Component of the Additive Logratios".

It supports what was suggested above about picking an OTU, and it may help further guide your analysis!

erikrikarddaniel commented 2 years ago

Thanks Ryan!

ailurophilia commented 2 years ago

Hi Daniel,

I've updated pick_base so that it now allows users to specify a prevalence threshold to use in picking a base taxon -- that is, pick_base can now optionally choose among taxa that, rather than being detected in all samples, are detected in at least some proportion of samples. My hope is that this will make choosing a base taxon easier in situations where divnet does not automatically pick a base. I'm going to close this issue for now at least, but please let me know if there is some other behavior you would find useful in pick_base or if you continue to have concerns about reproducibility.

Thanks!

Best, David

erikrikarddaniel commented 2 years ago

Thanks @ailurophilia!

adw96 / DivNet

Choice of base ASV and influence on reproducibility #119