bxlab / metaWRAP

MetaWRAP - a flexible pipeline for genome-resolved metagenomic data analysis
MIT License
391 stars 191 forks source link

the algorithm of MetaWRAP-Quant_bins #84

Open zhouyunyan opened 5 years ago

zhouyunyan commented 5 years ago

Dear developer: default I can't understand how to calculate the contig and bin's abundance? what the coverage mean,is the reads 's abundance or the percentage of the contig account for whole genome? the sincere hope for your help,thank you!

ursky commented 5 years ago

The abundance value is simmilar to TPM in RNAseq. It represents the average read coverage of a bin per million reads. So its really a standardized read coverage estimation.

Salmon produces the coverage values for each contig at first. Then I calculate the likely coverage of the bin overall by taking the average, but accounting for the lengths of the contigs. For example if a bin had these contigs: 1000bp at cov=10, 5000bp at cov=5, 10000bp at cov=11. Ave_cov = (10*1000 + 5*5000 + 11*10000)/(1000+5000+10000) = 145000/16000 = 9.0625

zhouyunyan commented 5 years ago

oh, I know that. I have three more questions. 1、As far as I know, commonly used standardized values are TPM (Trans Per Million) and RPKM (Reads Per Kilobase Million) in RNAseq,while the bin abundances are standardized to 100 million reads in each sample library before plotting in the Quant_bins module. Is there any basis for the determination of this value . 2、In Kraken module,I have got the html kronagram of each samples, then if I want to compare the difference of two group, what should I do? How can I get a abundance table of all sample that is simmliar to bin's abundance table? 3、If the host is pig, whether the hg38 database should be replaced with a pig's in Read-QC model. These are some problems I have encountered recently. I hope to get your help

ursky commented 5 years ago
  1. I believe this was a bug in previous versions. MetaWRAP should not be re-standardizing the abundances anymore. If you are using the latest version and find otherwise, please send me the details so I can fix.

  2. It isn't usually meaningful to compare taxonomy in this was so metaWRAP does not produce such a file by default. However, you have the .kraken files that you can use to produce a similar information.

  3. You can. You have to cheat a little bit and rename your pig index with hg38 prefix so metawrap thinks its the human genome.

zhouyunyan commented 5 years ago

1、I still don't really understand. you mean the latest version doesn't need to be standardized before plotting heatmap? Can the result of abundance_table.tab be use to compare bins's abundance between samples? 2 、This is the first column in .krona file. Can it be used as read's abundance? The taxonomy are different for each sample. I find it very difficult to Integrate into one table that similar to bin's abundance table. default

ursky commented 5 years ago
  1. Yes, you can directly compare samples without further standardization. The abundances are already expressed per million reads.

  2. Yes, the first column is the number of reads with that corresponding taxonomy (given the module was run on reads). It is up to you, the user, to perform further downstream analysis. It is difficult to integrate into a table because taxonomy is not meant to be analyzed in that way.

zhouyunyan commented 5 years ago

1、Did you mean the abundance_table.tab is calculated by taking the length-weighted average of the bins’s contig abundances first, then standardized to 1 million reads in each sample in the latest version. the abundance_table.tab have been standardized. 2、It isn't meaningful to compare taxonomy by Kraken's result, is it because the sample sequencing depth is different?So what is the purpose of using Kraken module to taxonomy

tianchen2019 commented 5 years ago

The abundance value is simmilar to TPM in RNAseq. It represents the average read coverage of a bin per million reads. So its really a standardized read coverage estimation.

Salmon produces the coverage values for each contig at first. Then I calculate the likely coverage of the bin overall by taking the average, but accounting for the lengths of the contigs. For example if a bin had these contigs: 1000bp at cov=10, 5000bp at cov=5, 10000bp at cov=11. Ave_cov = (10*1000 + 5*5000 + 11*10000)/(1000+5000+10000) = 145000/16000 = 9.0625

So,you firstly considered the the lengths of the contigs using the method like your example,and then you considered the sequencing depth. I got a table having two coverage vaules(the second coverage vaule is the first coverage vaule divided by the sum of the first coverage vaules of all bins in a sample). So the the second coverage vaule considered the sequencing depth ?

tianchen2019 commented 5 years ago

Dear developer: default I can't understand how to calculate the contig and bin's abundance? what the coverage mean,is the reads 's abundance or the percentage of the contig account for whole genome? the sincere hope for your help,thank you!

Where did you find this passage? the paper?

ursky commented 5 years ago

The passage comes from the github documentation, which has been updated I believe. The lengths of contigs don't go into the coverage calculation - they are only used to attribute weight to different contigs when calculating the average coverage, so that outlying smaller contigs don't alter the results in a major way. And what table are you referring to?

ghost commented 3 years ago

what equation do you use to transform the data in bin_abundance_heatmap.png to the to the the raw data in bin_abundance_table.tab? for example how a value of -4 in bin_abundance_heatmap.png is transformed to a value of 0 in the bin_abundance_table.tab? heat

tab
ursky commented 3 years ago

I believe the script adds 0.0001 to each value to be able to visualize it on a log scale. Feel free to plot the actual values however you want, however.

ghost commented 3 years ago

Do you have a references that you can share to justify the use of this algorithm. I got a reviewers comment saying: "Expressing abundance as CPM has no meaning, as it depends on the amount of sequencing done. You need relative abundance."

ursky commented 3 years ago

CPM stands for "copies per million reads". In other words, for every 1000000 metagenomic reads, this is how many came from this genome. So it should already be normalized to the library size. That being said, if the reviewer is more old school and wants percentage abundances, I would provide that alongside the CPM values. The reason I avoid % abundances is that it very difficult to understand what the denominator should be... Adding the abundances of all MAGs is shortsighted because it only accounts for the binned fraction. Adding the CPMs of all contigs is also problematic because of how short vs long contigs are accounted for. There are at least half a dozen ways to do this calculation and different people (and thus reviewers) swear by their method. It's a tough problem I'm microbiology in general - unfortunately, it's up to you to decide on what statistics make sense for your study. Good luck!

ursky commented 3 years ago

I'll also add that the affinity of researchers towards percentages stems from more old-school 16S amplicon sequencing experiments, where you know exactly how many molecules you sequenced and each molecule corresponds to a single organism (ideally) in the sample. Thus the percentage calculation is trivial. In metagenomic WGS this is not at all the case - organisms have different lengths and not everything is "accounted for" in the assembly, let alone the bins.

ghost commented 3 years ago

Thanks! I will be explaining this at a lab meeting and want to make sure I explain it right:

CPM

is this accurate?

ursky commented 3 years ago

Looks like a solid summary!

ghost commented 3 years ago

Hi, this was the feedback: Salmon coverage calculation (TPM) takes in consideration transcript length so the contig coverage values are standardized by contig length in nucleotides?

Screen Shot 2021-03-22 at 12 43 28 PM Screen Shot 2021-03-22 at 1 00 08 PM Screen Shot 2021-03-22 at 1 01 05 PM
ursky commented 3 years ago

I'm not sure what the problem is - yes, TPM is normalized for the gene length, only here the "gene" is the contig. Can you re-state your question?

quliping commented 9 months ago

The abundance value is simmilar to TPM in RNAseq. It represents the average read coverage of a bin per million reads. So its really a standardized read coverage estimation.

Salmon produces the coverage values for each contig at first. Then I calculate the likely coverage of the bin overall by taking the average, but accounting for the lengths of the contigs. For example if a bin had these contigs: 1000bp at cov=10, 5000bp at cov=5, 10000bp at cov=11. Ave_cov = (10*1000 + 5*5000 + 11*10000)/(1000+5000+10000) = 145000/16000 = 9.0625

In the quant_bins module of metawrap, I found some problems... The script 'split_salmon_out_into_bins.py' was used to summary the TPM results of MAGs of metawrap, right? However, I found that the caculation method in your script is totally different from what you said...It seems that you just chose a median value of a list of TPM in the script? I wrote my detail comments behind "##" in file 'compare.txt': compare.txt 00_bug

Besides, I think whatever TPM or Ave_cov in metawrap is just the realtive abundance of an MAG in all MAGs or a contig in all contigs in a assembly, right? If I want to compare the abundance of one or mutiple MAGs in different samples, but these MAGs were only parts of all MAGs retrieved from these samples or even obtained from other unrelated samples, what should I do? For example, I have 12 genomes (12 different species) of a genus, some of them were retrieved from my 80 samples, some were reference genomes. I want to know the abundance of the genus in the 80 samples. TPM seems inappropriate because I will got 80 '10,00,000'... I can only compare the relative abundance difference of the 12 species in 80 samples rather than the abundance difference of the entire genus in the 80 samples. The CPM of metawrap seems also imappropriate becaues it is very similar to TPM.

g-raymo commented 5 days ago

The abundance value is simmilar to TPM in RNAseq. It represents the average read coverage of a bin per million reads. So its really a standardized read coverage estimation. Salmon produces the coverage values for each contig at first. Then I calculate the likely coverage of the bin overall by taking the average, but accounting for the lengths of the contigs. For example if a bin had these contigs: 1000bp at cov=10, 5000bp at cov=5, 10000bp at cov=11. Ave_cov = (10*1000 + 5*5000 + 11*10000)/(1000+5000+10000) = 145000/16000 = 9.0625

In the quant_bins module of metawrap, I found some problems... The script 'split_salmon_out_into_bins.py' was used to summary the TPM results of MAGs of metawrap, right? However, I found that the caculation method in your script is totally different from what you said...It seems that you just chose a median value of a list of TPM in the script? I wrote my detail comments behind "##" in file 'compare.txt': compare.txt 00_bug

Besides, I think whatever TPM or Ave_cov in metawrap is just the realtive abundance of an MAG in all MAGs or a contig in all contigs in a assembly, right? If I want to compare the abundance of one or mutiple MAGs in different samples, but these MAGs were only parts of all MAGs retrieved from these samples or even obtained from other unrelated samples, what should I do? For example, I have 12 genomes (12 different species) of a genus, some of them were retrieved from my 80 samples, some were reference genomes. I want to know the abundance of the genus in the 80 samples. TPM seems inappropriate because I will got 80 '10,00,000'... I can only compare the relative abundance difference of the 12 species in 80 samples rather than the abundance difference of the entire genus in the 80 samples. The CPM of metawrap seems also imappropriate becaues it is very similar to TPM.

Are there any updates re this? is there another approach to calculating bin abundance?