bxlab / metaWRAP

MetaWRAP - a flexible pipeline for genome-resolved metagenomic data analysis
MIT License
396 stars 190 forks source link

mapping reads back to specific genes #350

Open DeaconOfBiology opened 3 years ago

DeaconOfBiology commented 3 years ago

I have a question that is likely a little outside of your scope for help. So, if there is another resource I should consult, could you please point me in that direction?

I used metawrap to bin assemblies that I created prior via metaSPAdes. So, I have 22 metagenomes from 11 different hot springs (11 sediment and 11 fluid samples). I assembled them de novo for a total of 22 different assemblies. I then concatenated all of them together into one big assembly file and then used metawrap to construct the bins. This co-binning technique gave me 400+ bins with ≥ 80 % completeness and less than < 5 % contamination. Thats after the binning and bin refining modules. Then I used the quant module to predict their abundance across the sample sites. Next, I annotated them with Prokka. Then identified specific redox genes of interest in these MAGs, copied the sequences of these specific genes, and then mapped the reads from all metagenomes to these specific gene sequences to test whether a gene was present when its MAG was present. The problem is I found that the genes were only present within a metagenome 60% of the time that a MAG was present.

Does this mean that my Co-binning approach created erroneous MAG; that MAGs have these genes in some sites (metagenomes) but not others; that these MAGs do have these genes, but the reads may not have been recovered; or something else altogether that Im missing? Thanks for any help/advice you can give me

ursky commented 3 years ago

A couple of thoughts. First, concatenating contigs from multiple metagenomic assemblies is not a good idea. Any genomes that were present in several sample will be present in multiples in the final assembly, which will present a major problem for binning... I don't have the names off the top of my head, but there are a couple software that will "dereplicate" contig sequences when combining assemblies to prevent duplicates. I usually avoid this problem by co-assembling whenever possible. Another solution is to bin each assembly individually, and then merge the bin sets into one while also de-replicating duplicate MAGs (perhaps keeping the best one of each species). DRep comes to mind for that.

As for you not finding the expected genes in the reads of the samples where their MAG genomes were present, you will have to dig deeper to find out since all your suggested explanations could be possible. My intuition tells me those 40% of samples were likely those with lower abundances of your MAGs of interest. Searching in the reads of the samples is in fact the most sensitive thing you could have done, but it's still possible to miss some sequences. For example, if the coverage of the MAG was ~0.5x in a sample, there is in fact a <50% chance that a given gene will have a read aligning to it. So I would limit my search to samples the MAGs were present in reasonably high numbers. Keep in mind that abundance estimates are prone to being a little over-inflated due to regions of homology with closely related MAGs.

DeaconOfBiology commented 3 years ago

Thanks so much for your incite! That helped a lot. I put the abundance cut off at ≥ 1 and that gave me 98% instead of the previous. You stated, "For example, if the coverage of the MAG was ~1x in a sample, there is in fact a <50% chance that a given gene will have a read aligning to it." Is there a paper or other source I could reference this from? Again, thanks a million!

ursky commented 3 years ago

Sorry, i meant if you have 0.5x coverage you have <50% chance of catching a catching given sequence. If you have 1x then your odds are around ~80% if I remember correctly. If reads were perfectly lined up end to end then 1x coverage would give you a 100% chance of catching it, but remember the chance of overlapping reads increases with more coverage so you actually have quite a bit less breadth of coverage (the formal term). So catching 98% of genes with a minimum coverage of 1x is actually surprisingly good, so perhaps the 1x coverage is a bit of an underestimation.

There is a formal formula to calculate the breadth of coverage of depth of coverage in old-school WGS sequencing manuals but I can't find it at the moment. (I believe its a derivation of 𝑓(𝑥)=1−𝑒−𝑘𝑥 but I can't remember for sure). If you can code a bit you could always make a sequencing simulation to convince yourself of this distribution :)

DeaconOfBiology commented 3 years ago

Thanks! That was a lot of help