AnantharamanLab / vRhyme

Binning Virus Genomes from Metagenomes
GNU General Public License v3.0
57 stars 9 forks source link

No bins created-- did we do something wrong or are there really no bins? #37

Open rikander opened 1 week ago

rikander commented 1 week ago

Hi Kris (and Karthik),

My student is trying to bin contigs identified by VirSorter using vRhyme. He is using as input contigs identified by VirSorter as viral, as well as sorted bam files that were mapped to the entire set of of contigs (which includes the ones that Virsorter did not flag as viral). (For reference, we had approximately 2-3 million contigs for each sample, but VirSorter identified about 10,-15,000 contigs as viral per sample. These samples were microbial metagenomes.) He ran vRhyme for 48 samples and vRhyme ran without errors, but it tells us we have no bins for any of the samples. Is this realistic, or did something likely go wrong? If so, do you have any thoughts?

Thanks! -Rika

KrisKieft commented 5 days ago

Hi,

That sounds a bit off. Did you take the set of 10-15k contigs and do one binning run using 48 samples? Versus 48 binning runs with 1 sample each. The former (1 run, 48 samples, set of dereplicated contigs) is the correct usage. Were any parameters changed from their default settings?

Kris

rikander commented 4 days ago

Hi Kris,

OK, so to clarify: we didn't do a co-assembly (that would break our server), so we have 48 separate sets of contigs. Should we combine those contigs into one dereplicated combined fasta file (which would contain something like 500k contigs) and do a vRhyme run on that? For the bam files then, should we map the reads of each sample against that combined fasta file?

We used default settings for all the vRhyme runs.

Thanks! Rika

KrisKieft commented 4 days ago

It's possible that using 1 sample (1 coverage value) per contig didn't give vRhyme enough information to bin. It semi-equally uses coverage and sequence features. I've gotten 1 to work before but certainly not the same quality results. My suggestion is to dereplicate your 500k viral contigs and use the dereplicated set as contig input. Yes, then map the the reads of each sample. There's a couple ways to do that. vRhyme can handle dereplication, or otherwise it uses a general method similar to what dRep uses. Then you can either have vRhyme map by just inputting the fastq files (select either BWA or Bowtie2) or you can map yourself and input the bam files.

This complicates things if you wanted vMAGs per sample to compare because at the end of binning you'd have combined vMAGs based on the dereplicated/combined set. For this vRhyme will generate a coverage file and you can assess coverage per contig per sample. However, as you know each of your samples invidually won't have the whole picture anyway due to variance in metagenome sequencing/assembly.

I hope that answers your question. The main takeaway is that vRhyme and other coverage-based tools often rely on >1 sample to bin accurately even though they tend to let you input 1.

rikander commented 3 days ago

Hi Kris,

OK, thanks! The first time we did it, we did have multiple coverage values for each sample (i.e. bam files for sample 1 mapped to sample 2, and sample 3, and sample 4...) but still found no bins in any of the samples. We'll still give this a try, so we'll have more contigs to work with in the single binning run-- so we'll combine all the assembled contigs together and make new bam files. We'll see how it goes.

Thanks, Rika