Diversity estimates not plausible

ARetter90 commented 2 years ago

Hi,

I already posted this in the divnet-rs page as I computed my diversity estimates with divnet-rs (5 replicates).

When computing Shannon and Inverse Simpson diversities for my samples from the divnet-rs output, samples that have a very low observed richness have shannon diversities of 9 and higher - how can this be? This is biologically not logical if i only have say 5 species in that sample and such a high shannon index and inverse simspn numbers of >1000. Do you have any suggestion on how to deal with this?

Alice

ailurophilia commented 2 years ago

Hi Alice,

That does indeed sound counterintuitive. Do you have a small reproducible example you can provide? E.g., count data you are feeding to divnet along with the code you are using (a short summary of output could also be useful)? This would make it easier to give you useful feedback.

Thanks!

Best, David

adw96 commented 2 years ago

Hi @ARetter90 -- thanks for posting. I'll leave this in @ailurophilia 's very capable hands!

@ailurophilia -- this is most likely due to the use of the psuedocount, and I'm not sure how the issue could be totally alleviated without a completely different approach to modeling and estimation. I wonder if trying a smaller psuedocount could partially alleviate the issue (x log x -> 0 as x -> 0). I think the relevant argument is perturbation. Hope that helps you troubleshoot!

mooreryan commented 2 years ago

Linking the issue from the divnet-rs page for reference: https://github.com/mooreryan/divnet-rs/issues/9. I closed the issue there to keep the troubleshooting in one space.

ARetter90 commented 2 years ago

I have attached the matrix and asv table I have used. the configurations were as follows: [model] em_iter = 6 em_burn = 3 mc_iter = 500 mc_burn = 250 stepsize = 0.01 perturbation = 0.05 replicates = 3 base_taxa = 38

[io] count_table = "./matrix_ASV_all.csv" output = "./Gw_divnet_output_all_30_06_22.csv" sample_data= "./matrix_null.csv" [misc] random_seed = 213000 matrix_null.csv matrix_ASV_all.csv

The output matrix is too large to be uploaded - do you need it?

ARetter90 commented 2 years ago

Hi @ARetter90 -- thanks for posting. I'll leave this in @ailurophilia 's very capable hands!

@ailurophilia -- this is most likely due to the use of the psuedocount, and I'm not sure how the issue could be totally alleviated without a completely different approach to modeling and estimation. I wonder if trying a smaller psuedocount could partially alleviate the issue (x log x -> 0 as x -> 0). I think the relevant argument is perturbation. Hope that helps you troubleshoot!

Ok thank you that might be it! Which perturbation would you chose then instead?

mooreryan commented 2 years ago

I can run the latest divnet-rs and try and reproduce your results.

One thing I did notice though...in your matrix_null.csv sample data file, each sample is in its own category using one-hot encoding; did you mean to do that?

mooreryan commented 2 years ago

By the way, I think your issue is a combination of the pseudocount plus the fact that you are not using any sample covariates in your model.

I made a smaller data set from your original for testing. I first sorted the taxa by count and then took 1000 of those evenly spaced along the the most to least abundant (to keep the same-ish abundance distribution), and removed any samples with 0 counts in the remaining 1000 taxa. Ran that with divnet to get diversity estimates.

The samples with the highest diversity estimates had a Shannon-div of about 6.9. For example, sample 87_Autumn_DNA had a Shannon index of ~6.868. That sample has 4 non-zero taxa with counts 2, 1, 1, and 1. So it's mostly zeros, which get replaced with the pseudocount (in your case 0.05). Now, 6.868 may seem kind of random, but if you convert it to the corresponding effective number, you get exp(6.868) => 961, which is right around the number of taxa that had a count of zero in that sample and so were replaced with the pseudocount. Since that sample is basically four taxa with counts near 1, and 996 taxa with "count" of 0.05, that 961 number seems about right.

So basically, since there isn't any sample groupings, each of these tiny samples has to "stand alone" and you end up getting those weird estimates. Is there a particular reason for you not including sample metadata/groupings in the model?

ARetter90 commented 2 years ago

Hej,

thank you for your long answer - I just wanted to recreate what divnet would create as output if you put the model matrix to NULL so I do not have to compute estimates for every single category I want to test since it is so computationally intensive I just wanted to have one matrix where I later test groups against each other with betta() and betta_random() - but you are saying it would be better if I compute my estimates directly from my covariates? My problem is really just the time that it takes and the many covariates that I want to test. But if you say that's the way to go especially to get the things published then I will do that? And you are saying the pertubation is fine with 0.05?

ailurophilia commented 2 years ago

Hi Alice,

Thanks for the additional information! I think I could provide more helpful advice if I had a little more clarity regarding the structure of your data and your analytical goals, since how best to use divnet will typically depend on these things (default settings may or may not be useful depending on what your aims are). Specifically, it would be great to know a bit more about the following:

What kinds of samples do you have sequencing data for? (I.e., what environments did you sample from?)
Do you have any technical replicate measurements (i.e., repeated measurements of the same sample)?
What covariates have you collected, and what comparisons would you like to be able to make across covariates?

Thanks!

Best, David

ARetter90 commented 2 years ago

Hi David,

-> I have sequencing data on groundwater samples, as well as affiliated surface waters (90 groundwater, 28 river) from two seasons spring and fall of the same year. And I have DNA and RNA data on all of them which I also want to compare. ->I have repeated measurments since we sampled the same wells from both seasons. ->I want to compare nucleic acid fractions between DNA and RNA and then only for RNA: comparing between seasons, between aquifer regions (we have 8 of them along an alpine river transect), and Landuse (corine land cover) ->I also want to test some ecological hypothesis where I need effective species against disturbances such as nitrate concentrations or organic matter content

Thanks a lot! Alice

ARetter90 commented 2 years ago

Hej,

I have done everything with covariate estimations now on our university cluster and it runs ok fast. So is the approach correct that if I want to test lets say diversity estimates between seasons I run the ASV table with a design matrix of autumn and spring and if i want to compare within autumn, i run an asv table only of the autumn data and compare river and groundwater samples? I wont create a design matrix where I test both at the same time, right? I put pertubation down to 0.01 now, what do you think about that - how would I justify it? And when I need overall shannon and simpson estimates, is one-hot encoding matrix still ok? ( To see a change with disturbances such as nitrate concentrations for instance)

Best, Alice

mooreryan commented 2 years ago

is one-hot encoding matrix still ok?

If you want divnet-rs variable encoding to match the way the R package does it, you will need to use dummy encoding rather than one-hot. Here is an example of exporting data from R.

ARetter90 commented 2 years ago

I know how to create these matrices and input for divnet-rs - but how would I create a dummy encoded matrix for the purpose of comparing all samples? Would I just use my sample ID as variable for dummy encoding with no intercept?

mooreryan commented 2 years ago

Ah, I see, when you say overall shannon/simpson estimates you mean with no covariates (ie using the samples themselves as a covariate) is that correct? If that is what you want to do, you can remove one of the columns in the one-hot encoded matrix_null.csv that you sent, and it will be in dummy encoding.

However, I still don't think it is a good idea to run your full data set with no covariates. Even with perturbation of 0.01, you will still have the same problem of the very high diversity estimates. Many of samples have quite low sequencing depth (e.g., 36 of your samples have fewer than 1000 reads or counts) and you have a pretty high number of taxa...when you combine that with not providing covariate info, it will result in a noticeable effect from the zero-replacement on decent number of your samples (ie the really high diversity estimates that you saw for some samples).

The Willis lab members will have a better idea as to whether it is a good idea or not though!

AlexaBennett commented 1 year ago

Hi all! (please let me know if this should be a new issue) Has there been any work into how to identify if an estimate is heavily influenced by sudocounts? and any minimum recommendations for the number of features, total observation, and zeros where one would expect for the sudocounts to begin to significantly influence the estimates?

svteichman commented 1 year ago

Hi @AlexaBennett!

In the paper written on the methods implemented in DivNet, you can check out Figure 2, which investigates the effect of pseudocounts on a few of the diversity metrics implemented in DivNet. This gives an example of investigating in a data analysis how changing the magnitude of pseudocounts changes diversity estimates. Additionally, when you run the function divnet on your data, you have the option to change the perturbation parameter. You can use this parameter to run your own sensitivity analysis by changing the value of the perturbation (which determines the pseudocount) and see how this affects the results that you get.

There are no minimum recommendations for the number of features, total observations, and zeros where we would expect the pseudocounts to start to significantly influence the estimates, it will depend on a combination of those things (which is why running a sensitivity analysis on your data or similar simulated data could be helpful in your setting).

Let me know if you have any questions about this.

Best, Sarah

adw96 / DivNet

Diversity estimates not plausible #126