RasmussenLab / phamb

Downstream processing of VAMB binning for Viral Elucidation
MIT License
44 stars 8 forks source link

High number of bacterial genes in phamb assembled bins #47

Closed ShailNair closed 11 months ago

ShailNair commented 1 year ago

HI,

I used phamb with recommended workflow(not in parallel) with the default settings on my assembled metagenomic contigs (mixed of all microbial contigs). Later, I used CheckV ( with prodigal -m option enabled) on the concatenated fasta file. Strangely, CheckV analysis revealed that a large number of the bins contained a high number of host (bacterial) genes, accounting for more than 50% (many contigs with more than 70%) of the total number of genes. Surprisingly, CheckV indicates that many of these bins are complete and without contamination. However, the presence of such a large number of host genes will interfere in the downstream analysis. I have attached my checkv results for your reference. quality_summary.txt

joacjo commented 1 year ago

Hi Shail

The reason for running CheckV is to identify reliable viral-bins, therefore we recommend to only consider Medium and High-quality bins for further analysis (those based on the AAI-model). Low-quality shall always be considered with a lot of scepticism.

As for "Complete-bins" I checked your quality file, which had 4 examples, I see in those that a lot of sequence has been removed by CheckV. Perhaps you wanna evaluate the resulting cleaned sequences by CheckV.

Best, Joachim

ShailNair commented 1 year ago

Thank you for your prompt response. Yes, only medium and high quality bins will be used for further analysis. I'll re-run checkv on cleaned fasta files (checkv output from the first run) and let you know how it goes.

ShailNair commented 1 year ago

I ran CheckV again and found that the high number of bacterial genes was mostly in quality bins (high and medium) identified by the HMM-based (lower-bound) model of CheckV.