adw96 / DivNet

diversity estimation under ecological networks
83 stars 18 forks source link

DivNet on rRNA gene counts derived from metagenomes? #128

Open mgabriell1 opened 2 years ago

mgabriell1 commented 2 years ago

Hi, First of all, thanks for developing this tool!

I have a few metagenomic samples from which I've estimated the number of couple of reads mapping in proper pair to the SSU rRNA genes present in SILVA and I was thinking to use DivNet to potentially provide more support to my beta diversity analyses.

I definitely have lower counts per gene due to the untargetedness of shotgun sequencing and which I guess could result in a somewhat higher influence of the addition of the pseudocount.
The larger number of singletons which might be present in one sample but not in another, taking into account the uncertainty due to the sampling process, not be considered as not enough evidence for their difference. So I suspect that this would result in a very conservative analysis

Even given the potential conservative nature of this, would it be correct to use it also in my scenario? Thank you again for your time!

Marco

scubalaina commented 1 year ago

Hi there,

I also have a similar question and just wanted to boost this! I'm working with RNAP (B and B' subunit genes separately) which are single-copy markers, so I don't have to worry about copy-numbers skewing thigns, but I'm wondering how the diversity calculations are implemented and interpreted with metagenomic data in which the whole composition of the single-gene community only accounts for a very small portion of the reads/members of the community - in other words, their relative abundances will not sum to 1?

Thanks, Alaina :)

mooreryan commented 1 year ago

@scubalaina I have used DivNet in a similar way to you. When you are running it on the subcommunity (ie just the rna pol seqs) you are passing the data to DivNet as counts right? If so, it will go through its process treating that as samples/community in the right way.

scubalaina commented 1 year ago

Hi Ryan,

Ok great! I am using the reads per kilobase because each gene has a different length, and I need to normalize for that, but that's great it has worked for you with using a subcommunity of the data so it should work for mine similarly.

Thanks, Alaina :)

On Thu, Jun 22, 2023 at 2:31 PM Ryan Moore @.***> wrote:

@scubalaina https://github.com/scubalaina I have used DivNet in a similar way to you. When you are running it on the subcommunity (ie just the rna pol seqs) you are passing the data to DivNet as counts right? If so, it will go through its process treating that as samples/community in the right way.

— Reply to this email directly, view it on GitHub https://github.com/adw96/DivNet/issues/128#issuecomment-1603132173, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFRYVGLWHUGPQNXUFKBOGELXMSFPXANCNFSM54XJJAJA . You are receiving this because you were mentioned.Message ID: @.***>

mooreryan commented 1 year ago

@scubalaina something to keep in mind about normalizations ...you will be changing the read counts which could have an affect on variance estimations. Check out this tiny example. It's a silly contrived example where each gene has the same gene length, but the counts are still normalized by the gene length (ie reducing the count equally for all sample/genes in this particular example, and so increasing the variance). Of course this is just a silly example, but the point is that normalizing could impact variance estimations. Though, in practice, I'm not sure how much of an issue it will be. Someone from the Willis lab will have to comment on that.

One other thing if you're doing some normalization, you could think of a gene in a sample that has a low count like 2, but it is a 4kb gene, so its "per kilobase" count would be 0.5. Depending on your choice of pseudocount (for example, 0.5 was chosen in the DivNet manuscript for the analysis) that could be around the sam as that normalized count. Another thing to keep in mind.

divnet_rpk_variance.R.txt

alpha_div

(Not relevant to this discussion, but I work in a viral ecology lab, so I know some of your papers! Just a cool coincidence :smile:)

scubalaina commented 1 year ago

Hi Ryan,

Ah I see! That makes sense! Thank you for taking the time to demonstrate. I really appreciate your help in understanding this all. I clearly needed to take more stats classes in grad school haha I wonder how one could avoid compromising variance calculations without overestimating the abundance of longer genes if gene length isn't accounted for? I did notice when I ran divnet on my normalized read counts that differences in Shannon's diversity were no longer significant - or at least the divnet output had overlapping confidence intervals. I attached the example below of divnet vs a Wilcox test of vegan's Shannon's diversity calculation. Should I be interpreting this as no difference between the diversity of these groups?

Sorry to take up more of your time! I really, really appreciate the help! Awesome you're in viral ecology! I think I saw you're in Eric Wommack's lab? Super cool!

Thanks again for your time and help! Alaina :)

On Mon, Jun 26, 2023 at 12:56 PM Ryan Moore @.***> wrote:

@scubalaina https://github.com/scubalaina something to keep in mind about normalizations ...you will be changing the read counts which could have an affect on variance estimations. Check out this tiny example. It's a silly contrived example where each gene has the same gene length, but the counts are still normalized by the gene length (ie reducing the count equally for all sample/genes in this particular example, and so increasing the variance). Of course this is just a silly example, but the point is that normalizing could impact variance estimations. Though, in practice, I'm not sure how much of an issue it will be. Someone from the Willis lab will have to comment on that.

One other thing if you're doing some normalization, you could think of a gene in a sample that has a low count like 2, but it is a 4kb gene, so its "per kilobase" count would be 0.5. Depending on your choice of pseudocount (for example, 0.5 was chosen in the DivNet manuscript for the analysis) that could be around the sam as that normalized count. Another thing to keep in mind.

divnet_rpk_variance.R.txt https://github.com/adw96/DivNet/files/11871614/divnet_rpk_variance.R.txt

[image: alpha_div] https://user-images.githubusercontent.com/3172014/248871802-872a19b5-5489-4a7f-b19a-b05c684b083d.png

(Not relevant to this discussion, but I work in a viral ecology lab, so I know some of your papers! Just a cool coincidence 😄)

— Reply to this email directly, view it on GitHub https://github.com/adw96/DivNet/issues/128#issuecomment-1607863649, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFRYVGJBGA3IC23DM5R6FX3XNG5LTANCNFSM54XJJAJA . You are receiving this because you were mentioned.Message ID: @.***>

mooreryan commented 1 year ago

I wonder how one could avoid compromising variance calculations without overestimating the abundance of longer genes if gene length isn't accounted for?

^ Yeah, that's a good question...as far as I know that is still an open research question. Someone from the Willis lab will have to weigh in here.

I attached the example below

^ I think you may have forgotten the attachment...I'm not seeing it.

(Yep in Wommack's lab...small world haha!)

scubalaina commented 1 year ago

Hi Ryan,

Sorry I was corresponding via email so the attachment probably didn't work through github. Here it is!

Screenshot 2023-06-26 at 5 52 35 PM