Closed Jiulong-Zhao closed 2 years ago
Hi Jiulong
Thank you for the kind words! I will address your questions one by one.
If you check out this script: https://github.com/RasmussenLab/phamb/blob/master/workflows/mag_annotation/scripts/run_RF.py There is a minimum bin-size argument to the function (write_concat_bins) that concatenate the contigs of a viral-like bin, by default it's set to 5000bp. So only bins with a size of >=5000 bp are written to the .fna files that explains the number discrepancy. You can just change this argument to i.e. 2000 to have smaller bins written to the .fna files, if you are looking for micro viruses.
Like with bacterial MAGs, the sequential order of the contigs in a viral bin/viral MAG is not known. Unless you find a reference genome to guide how to put the genome/contigs puzzle together, even though mosaicism in viruses may hinder this effort. The contigs are, by choice, not connected by gaps like "XXX" or some other accepted DNA character in fasta-files as it might mess up the viral evaluation machinery of CheckV.
I am glad you also were surprised to find some viral bins annotated as "High-quality" even though they contain numerous host-genes. So were we when we evaluated viral MAGs with CheckV. If you look closely in your table, the viral bins were evaluated by CheckV's HMM-model which Is only benchmarked and evaluated on single-contig viruses and not tailored for viral MAGs. In the manuscript we have addressed these predictions and recommend that they should not be taken seriously and discarded, instead researchers should focus on AAI-based predictions that are more whole-genome alignment based. That does not mean that all HMM-based predictions are wrong though, they are just based on viral-markers that work best with single-contig viruses.
The last two rows in your table looks very much like Giant-viruses and were predicted by the AAI-model.
I hope you find this information helpfull.
Best, Joachim
Hi @joacjo, I do appreciate your kindly and patient reply!
Thanks for your reply! Best, Jiulong
Hi Jiulong
Gene annotation could potentially be affected by the random order, even though I think it's quite rare and I have not seen a benchmark paper on this. I.e. if the last gene on Contig 1 is partial and has no stop codon gets connected to the first gene on Contig 2 that only has a stop codon, thus creating an artificial complete gene. So worst case scenario, you might make one artificial gene for every 2 contigs in your bin, if you're very unlucky.
I would include both high-confidence and medium-confidence for High-quality AAI-predicted bins. Regarding Medium-quality viral bins, I would probably get a third party prediction on those to increase the confidence in their "viralness", like running Virfinder or VIBRANT as well.
That is a good question. I am sure the VAMB-binner also bins NCLDV MAGs since we have found several Huge viruses in our benchmarks. The more relevant question is more like, how can those confidently be identified?
Best, Joachim
Hi Joachim,
Best, Jiulong
Hi Jiulong
Regarding (1) - Yes if you write the contigs of your viral bins of interest into separate fasta files, that is the most straightforward and safest way for you to do gene annotations and, plus as you say, you have your viral bins in a format suitable for dereplication with tools like dRep 👍
Hope the information was helpfull to you. I will close the issue, feel free to open another if other questions or suggestions arise.
Best, Joachim
- If you obtained your NCLDV MAGs with either Metabat2 or VAMB there is no order to the contigs, they were simply grouped together. If you wanna dereplicate them with your new viral bins, you could dereplicate them on a MAG level with something like https://github.com/MrOlm/drep, I believe it takes fasta files as input, that is one fasta file for each bin with the contigs as separate fasta entries.
Gene annotation could potentially be affected by the random order, even though I think it's quite rare and I have not seen a benchmark paper on this. I.e. if the last gene on Contig 1 is partial and has no stop codon gets connected to the first gene on Contig 2 that only has a stop codon, thus creating an artificial complete gene. So worst case scenario, you might make one artificial gene for every 2 contigs in your bin, if you're very unlucky.
Hi,
I have some questions and would like to consult with you. I apologize for my lack of experience in metagenomic analysis, which leads me to question the practice of directly concatenating sequences within the same "bin" as the output file. I would like to understand the purpose behind this approach and how to interpret and use the output results from phamb. What is the typical purpose of using phamb?
I thought that the "binning" step clusters similar contigs (perhaps to select a representative sequence for redundancy reduction or simply to understand the clustering patterns of sequences) and potentially further assembles them to obtain more complete sequences. However, if all contigs from a same bin are randomly concatenated, it seems that neither of the aforementioned purposes of binning is achieved.
So, if my goal is to assemble and acquire viral sequences from a metagenomic sequencing dataset as comprehensively as possible, should I not use the fasta file output from phamb? Instead, should I run virus prediction or completeness assessment (e.g., CheckV) software on all sequences within each viral bin predicted by phamb, and then extract each viral contig sequences separately? In this case, I'm not sure when and how to make use of the results generated by phamb/vamb? To be more specific, even though the results from vamb indicate clustering, it seems that my downstream analysis still needs to be performed on individual sequences rather than clusters. The concatenated sequences outputted by phamb also cannot be considered as a complete or fragmented genome sequence for use in downstream analysis?
If I could receive your answer, I would greatly appreciate it.
Hi actledge
The binning step with VAMB clusters contigs likely originating from the same genome (not similar contigs).
Therefore the Phamb workflow helps you in some cases recover more complete virus genomes as vMAGs, compared to not using and evaluating single-contigs. The suggested workflow is to run the binning step to find putative vMAGs, run CheckV and select bona fide viruses (i.e. MQ , HQ).
Hope this helps!
Hi actledge
The binning step with VAMB clusters contigs likely originating from the same genome (not similar contigs).
Therefore the Phamb workflow helps you in some cases recover more complete virus genomes as vMAGs, compared to not using and evaluating single-contigs. The suggested workflow is to run the binning step to find putative vMAGs, run CheckV and select bona fide viruses (i.e. MQ , HQ).
Hope this helps!
Hi joacjo,
Thank you very much for your response,it has been really helpful to me. But I still have a slight confusion because I am relatively new to metagenomic analysis. The genome order obtained from vMAGs is indeed shuffled, and although it seems that the shuffled order may not significantly affect gene annotation and completeness assessment. But does this also mean that the vMAGs obtained through such binning approach cannot be considered as a "genome" sequence and cannot be uploaded to public databases (such as NCBI) as a draft or complete genome? Or is it generally accepted to have shuffled order for vMAGs obtained from metagenomic sequencing?
Or is it because this is a strategy for virus research? Because as far as I understand, most of the current analysis software for phages or viruses defaults to treating a single sequence as a virus. Is it because of this reason that all sequences are concatenated together to form a single vMAG, instead of treating them as different "contigs" from the same genome like other species, and putting them in the same fasta file to indicate that they originate from the same genome?
Thanks!
Hi actledge
Ah! If the purpose of your research is to identify and upload new and complete virus genomes to NCBI , then I would not recommend phamb which is more oriented towards virus research. For this purpose, I would advice you to run Genomad followed up by CheckV on the individual contigs.
There are some recommended and strict guidelines for submitting new and complete viruses to NCBI and they do not currently cover vMAGs. Checkout this paper: https://www.nature.com/articles/s41587-023-01844-2
Hi developers! Thanks for your contribution to the study field of viral ecology! Recently, I used the PHAMB tool to identify viral bins from my bulk metagenomes, and I had some questions about the output results.
Thanks for your attention and reply in patience! Looking forward to your reply! Jiulong