bxlab / metaWRAP

MetaWRAP - a flexible pipeline for genome-resolved metagenomic data analysis
MIT License
391 stars 191 forks source link

normalize the relative abundance for each bin #40

Open itiago opened 6 years ago

itiago commented 6 years ago

Hi I'm sorry, but could you please help me on this, so I have determined the relative abundance for each bin, but I wanted to normalize the values according to the overall metagenome, but I really not seeing the maths involved... Genomic bins SampleMetaGenMerged_R1_val bin.9 10.5060241613 bin.20 17.5038193854 bin.4 3.14227878024 bin.22 3.1807582388 bin.6 5.34662812579 bin.15 16.0117605048 bin.17 20.7550530435 bin.3 5.5673562303 bin.2 8.66532434566

samples reads

SampleMetaGenMerged_R1_val 23887914

I did 10.5060241613 * 100 / 23887914, but i gives me a very low number like 4E-05

Is this correct?

Thanks for any help Best Igor

ursky commented 6 years ago

There are a few ways you can approach this, depending on your applications. The abundances from Quant_bins are the bin depths, which are effectively the average read coverage of the bin in that sample. If your goal is to standardize the abundances to the library size in each sample, you could compute std_abund=big_number*abund/lib_size, where the big_number is a constant of your choice (the average of the library sizes, for example). There are statistical issues with this approach, however, particularly if some samples are more or less complex than others. The alternative approach is to standardize to the sum of the bin abundances in each sample std_abund=big_number*abund/tot_abund. I personally lean to this method (especially if I recover a good number of bins), but it really depends on your downstream applications, which is why I left it open-ended in the module.

ursky commented 6 years ago

I saw that you have only one sample, so maybe your goal is to estimate the bin representation as a function of reads? Then you could estimate the number of reads mapping to each bin: abund=bin_length*abundance/read_length. It really depends on individual applications. Hope that helps.

itiago commented 6 years ago

Hi Gherman Thanks for the fast reply. And yes you are correct, I just have one sample, and yes I wanted to determine the abundance for each bin. I'm sorry, but I think that all this comes from me not understanding what the values mean. So does the value bin.9 10.50602416 means really? The average coverage is 10 reads per contig on the bin? That does not make any sense, so is not that for sure... I have that problem with this values.

And when you gave me the equation: abund=bin_length*abundance/read_length.

And as I have these values from metawrap: bin completeness contamination GC lineage N50 size binner bin.9 78.26 0.784 0.449 Euryarchaeota 5598 1073567 binsA

For bin 9 it should be like: bin_lenght = 1073567 abundance = 10.50602416 read_lenght = 50 (read lenght of the metagenome library?)

I'm sorry about this, This is already out of metawrap purpose, but if you could help me on understanding this it would be great.

Many Thanks Best Igor

On Tue, Sep 11, 2018 at 5:29 PM, Gherman V. Uritskiy < notifications@github.com> wrote:

I saw that you have only one sample, so maybe your goal is to estimate the bin representation as a function of reads? Then you could estimate the number of reads mapping to each bin: abund=bin_length*abundance/ read_length. It really depends on individual applications. Hope that helps.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/bxlab/metaWRAP/issues/40#issuecomment-420335700, or mute the thread https://github.com/notifications/unsubscribe-auth/AQAOmVqQQBYZLHlfqKw9x9FlHO60ZQ3Hks5uZ-UBgaJpZM4Wjn2S .

-- Igor Tiago Researcher Universidade de Coimbra

Laboratório Microbiologia Edificio Patronato Rua da Matemática Nº 49 3000-276 Coimbra, Portugal

ursky commented 6 years ago

No problem, Igor.

The bin.9 10.50602416 value means that on average, the contigs in bin.9 have a read coverage of ~10. So in your example, the total estimated number of reads mapping to your bin is 1073567*10.50602416/100 = 113,755 reads (I used 100bp for read length because I assumed it was a paired-end library). So of your total library size of 23,887,914 reads, at least 0.5% came from this organism. However, if you goal is to know the bin's "abundance", using the original read depth as a representation of the abundance (10.50) is probably more accurate. Because you may have only a part of the whole genome, read depth is a better metric. Just dont forget to standardize them if you start comparing samples. Hope that makes sense.

itiago commented 6 years ago

Gherman, sorry once again...

The first part I understood: The bin.9 10.50602416 value means that on average, the contigs in bin.9 have a read coverage of ~10. So in your example, the total estimated number of reads mapping to your bin is 1073567*10.50602416/100 = 113,755 reads (I used 100bp for read length because I assumed it was a paired-end library). So of your total library size of 23,887,914 reads, at least 0.5% came from this organism. So THANK SO MUCH , Finally I understand what is all about, but

you lost me on the last part.... However, if you goal is to know the bin's "abundance", using the original read depth as a representation of the abundance (10.50) is probably more accurate. Because you may have only a part of the whole genome, read depth is a better metric. Just dont forget to standardize them if you start comparing samples. Hope that makes sense

Where can I find the "original read depth" values? And how should I use them...

Once again, thank you for your availability to help

Igor

On Tue, Sep 11, 2018 at 5:53 PM, Gherman V. Uritskiy < notifications@github.com> wrote:

No problem, Igor.

The bin.9 10.50602416 value means that on average, the contigs in bin.9 have a read coverage of ~10. So in your example, the total estimated number of reads mapping to your bin is 1073567*10.50602416/100 = 113,755 reads (I used 100bp for read length because I assumed it was a paired-end library). So of your total library size of 23,887,914 reads, at least 0.5% came from this organism. However, if you goal is to know the bin's "abundance", using the original read depth as a representation of the abundance (10.50) is probably more accurate. Because you may have only a part of the whole genome, read depth is a better metric. Just dont forget to standardize them if you start comparing samples. Hope that makes sense.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/bxlab/metaWRAP/issues/40#issuecomment-420343239, or mute the thread https://github.com/notifications/unsubscribe-auth/AQAOmfhjObXdjZAieZ29UF7d3Digx9UTks5uZ-qFgaJpZM4Wjn2S .

-- Igor Tiago Researcher Universidade de Coimbra

Laboratório Microbiologia Edificio Patronato Rua da Matemática Nº 49 3000-276 Coimbra, Portugal

ursky commented 6 years ago

Ok, so here's the issue. You calculated that 0.5% of the reads map back to bin.9. However you dont have the entire genome - just 78%. That means that in theory even more reads map to the genome, you just dont have it. On the other hand, the original bin read coverage (i.e. 10.50602416) should still be roughly the same even if you recovered 50% of the genome or 100% of the genome. So why not just use that as your abundance - this is the estimate read depth of the bin, and corresponds to its abundance in the sample.

itiago commented 6 years ago

Now is my English... So 0.5% is correct even if the genome is incomplete. I agree, it is probability. On the other hand If your are saying that 10.5 is to be used, the 10.5 is total coverage of the bin but related to what? To others bins? If so it is representing 5% of the total populations determined by binning, but not related to the all populations (since this maths does enter with the non binned populations). am I thinking correctly, was this what you meant? I have do dig on the "coverage" definition deeply!!

Thank you very much

A terça, 11/09/2018, 18:22, Gherman V. Uritskiy notifications@github.com escreveu:

Ok, so here's the issue. You calculated that 0.5% of the reads map back to bin.9. However you dont have the entire genome - just 78%. That means that in theory even more reads map to the genome, you just dont have it. On the other hand, the original bin read coverage (i.e. 10.50602416) should still be roughly the same even if you recovered 50% of the genome or 100% of the genome. So why not just use that as your abundance - this is the estimate read depth of the bin, and corresponds to its abundance in the sample.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/bxlab/metaWRAP/issues/40#issuecomment-420352300, or mute the thread https://github.com/notifications/unsubscribe-auth/AQAOmTjMTXAoKCCsOiHTBA3lw_20yq5Bks5uZ_E-gaJpZM4Wjn2S .

ursky commented 6 years ago

To put it simply, if your goal is to relate the relative abundance of the bins to each other (which one is more abundant in each sample), then use the 10.5 value (if you have many samples, you will need to standardize this). If you need to determine the relative abundance of the bin in the whole community, use the 0.5% value (roughly what % of the reads came from the organism).

itiago commented 6 years ago

Gherman And you made it simple! Thank you for your patience and teaching! I get it now. All best Igor

On Wed, Sep 12, 2018 at 2:51 PM, Gherman V. Uritskiy < notifications@github.com> wrote:

To put it simply, if your goal is to relate the relative abundance of the bins to each other (which one is more abundant in each sample), then use the 10.5 value (if you have many samples, you will need to standardize this). If you need to determine the relative abundance of the bin in the whole community, use the 0.5% value (roughly what % of the reads came from the organism).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/bxlab/metaWRAP/issues/40#issuecomment-420655581, or mute the thread https://github.com/notifications/unsubscribe-auth/AQAOmW5a2ry-Y3LBZcvKNSme6uKYm1bcks5uaRFFgaJpZM4Wjn2S .

-- Igor Tiago Researcher Universidade de Coimbra

Laboratório Microbiologia Edificio Patronato Rua da Matemática Nº 49 3000-276 Coimbra, Portugal

saad272 commented 1 year ago

To put it simply, if your goal is to relate the relative abundance of the bins to each other (which one is more abundant in each sample), then use the 10.5 value (if you have many samples, you will need to standardize this).

Hello @ursky If I have several samples how can I standardize the abundance value? if i use the 10.5 value !

Thank you