Result assessment! - Githubissues

Jiulong-Zhao commented 2 years ago

Hi developers! Thanks for your contribution to the study field of viral ecology! Recently, I used the PHAMB tool to identify viral bins from my bulk metagenomes, and I had some questions about the output results.

The result files contained vambbins_RF_predictions.txt and vamb_bins/vamb_bins.1.fna. In the file of vambbins_RF_predictions.txt, there were hundreds of thousands of bins labeled as "viral", whereas, in the file of vamb_bins/vamb_bins.1.fna, there were only tens of thousands of viral bin sequences recorded. So, which one is the real result?
We have known that the input files were assembled contig sequences and the cluster information resulted from VAMB. So, why are there no gaps in the bin sequences in the file of vamb_bins/vamb_bins.1.fna? For each viral bin, were multiple contig sequences connected without gaps? How to determine their sequential order?
After I got the viral bins, CheckV was used to evaluate the viral genome quality with the file vamb_bins/vamb_bins.1.fna as input. As a result, more than 500 viral bins were considered High-quality viral genomes. That's great! However, there are more than ten viral bins with a genome length of larger than 400 bp, and the longest one is more than 600 bp. So were these viral bins potentially belonging to giant viruses? And why did CheckV consider them as high-quality viral genomes when these viral bins contained high-proportioned host genes and low-proportioned viral genes (see figure below).

Thanks for your attention and reply in patience! Looking forward to your reply! Jiulong

joacjo commented 2 years ago

Hi Jiulong

Thank you for the kind words! I will address your questions one by one.

If you check out this script: https://github.com/RasmussenLab/phamb/blob/master/workflows/mag_annotation/scripts/run_RF.py There is a minimum bin-size argument to the function (write_concat_bins) that concatenate the contigs of a viral-like bin, by default it's set to 5000bp. So only bins with a size of >=5000 bp are written to the .fna files that explains the number discrepancy. You can just change this argument to i.e. 2000 to have smaller bins written to the .fna files, if you are looking for micro viruses.
Like with bacterial MAGs, the sequential order of the contigs in a viral bin/viral MAG is not known. Unless you find a reference genome to guide how to put the genome/contigs puzzle together, even though mosaicism in viruses may hinder this effort. The contigs are, by choice, not connected by gaps like "XXX" or some other accepted DNA character in fasta-files as it might mess up the viral evaluation machinery of CheckV.
I am glad you also were surprised to find some viral bins annotated as "High-quality" even though they contain numerous host-genes. So were we when we evaluated viral MAGs with CheckV. If you look closely in your table, the viral bins were evaluated by CheckV's HMM-model which Is only benchmarked and evaluated on single-contig viruses and not tailored for viral MAGs. In the manuscript we have addressed these predictions and recommend that they should not be taken seriously and discarded, instead researchers should focus on AAI-based predictions that are more whole-genome alignment based. That does not mean that all HMM-based predictions are wrong though, they are just based on viral-markers that work best with single-contig viruses.

The last two rows in your table looks very much like Giant-viruses and were predicted by the AAI-model.

I hope you find this information helpfull.

Best, Joachim

Jiulong-Zhao commented 2 years ago

Hi @joacjo, I do appreciate your kindly and patient reply!

As you said, I checked some viral bin sequences and found that all contigs were connected without any gaps. So I wonder if these contigs were connected in a random order as bin sequences. I have some other viral bins (actually they were NCLDV MAGs) obtained through other methods, and I want to merge all these viral bins followed by the clustering of these viral bins into the species level. So, I wonder if I can connect the contigs of my NCLDV MAGs in a random order to generate the viral bins. Additionally, may the connection of contigs in random order affect the downstream analyses of these viral bins, like gene annotations?
Thanks for your pointing out my impropriety in selecting the HQ viral bins. Should I select the HQ viral MAGs evaluated by CheckV's AAI-based high-confidence model only or both high-confidence and medium-confidence models? Do you think the Medium-quality viral MAGs should also be selected for the downstream analyses?
I wonder if this tool is suitable for binning NCLDV MAGs or only for binning phage MAGs?

Thanks for your reply! Best, Jiulong

joacjo commented 2 years ago

Hi Jiulong

If you obtained your NCLDV MAGs with either Metabat2 or VAMB there is no order to the contigs, they were simply grouped together. If you wanna dereplicate them with your new viral bins, you could dereplicate them on a MAG level with something like https://github.com/MrOlm/drep, I believe it takes fasta files as input, that is one fasta file for each bin with the contigs as separate fasta entries.

Gene annotation could potentially be affected by the random order, even though I think it's quite rare and I have not seen a benchmark paper on this. I.e. if the last gene on Contig 1 is partial and has no stop codon gets connected to the first gene on Contig 2 that only has a stop codon, thus creating an artificial complete gene. So worst case scenario, you might make one artificial gene for every 2 contigs in your bin, if you're very unlucky.

I would include both high-confidence and medium-confidence for High-quality AAI-predicted bins. Regarding Medium-quality viral bins, I would probably get a third party prediction on those to increase the confidence in their "viralness", like running Virfinder or VIBRANT as well.
That is a good question. I am sure the VAMB-binner also bins NCLDV MAGs since we have found several Huge viruses in our benchmarks. The more relevant question is more like, how can those confidently be identified?

Best, Joachim

Jiulong-Zhao commented 2 years ago

Hi Joachim,

Regarding the worst-case scenario you mentioned, I totally agree with you! So, how about dividing the viral bins into the original multiple contigs? This can result in one fasta file for each bin with the contigs as separate fasta entries, beneficial for the dereplication by dRep.
You are right that it is difficult to identify the confidently of this tool on identifying the NCLDV or Phage MAGs. Anyway, it is enough for you guys to develop this strong tool!

Best, Jiulong

joacjo commented 2 years ago

Hi Jiulong

Regarding (1) - Yes if you write the contigs of your viral bins of interest into separate fasta files, that is the most straightforward and safest way for you to do gene annotations and, plus as you say, you have your viral bins in a format suitable for dereplication with tools like dRep 👍

Hope the information was helpfull to you. I will close the issue, feel free to open another if other questions or suggestions arise.

Best, Joachim

actledge commented 10 months ago

If you obtained your NCLDV MAGs with either Metabat2 or VAMB there is no order to the contigs, they were simply grouped together. If you wanna dereplicate them with your new viral bins, you could dereplicate them on a MAG level with something like https://github.com/MrOlm/drep, I believe it takes fasta files as input, that is one fasta file for each bin with the contigs as separate fasta entries.

Gene annotation could potentially be affected by the random order, even though I think it's quite rare and I have not seen a benchmark paper on this. I.e. if the last gene on Contig 1 is partial and has no stop codon gets connected to the first gene on Contig 2 that only has a stop codon, thus creating an artificial complete gene. So worst case scenario, you might make one artificial gene for every 2 contigs in your bin, if you're very unlucky.

Hi,

I have some questions and would like to consult with you. I apologize for my lack of experience in metagenomic analysis, which leads me to question the practice of directly concatenating sequences within the same "bin" as the output file. I would like to understand the purpose behind this approach and how to interpret and use the output results from phamb. What is the typical purpose of using phamb?

I thought that the "binning" step clusters similar contigs (perhaps to select a representative sequence for redundancy reduction or simply to understand the clustering patterns of sequences) and potentially further assembles them to obtain more complete sequences. However, if all contigs from a same bin are randomly concatenated, it seems that neither of the aforementioned purposes of binning is achieved.

So, if my goal is to assemble and acquire viral sequences from a metagenomic sequencing dataset as comprehensively as possible, should I not use the fasta file output from phamb? Instead, should I run virus prediction or completeness assessment (e.g., CheckV) software on all sequences within each viral bin predicted by phamb, and then extract each viral contig sequences separately? In this case, I'm not sure when and how to make use of the results generated by phamb/vamb? To be more specific, even though the results from vamb indicate clustering, it seems that my downstream analysis still needs to be performed on individual sequences rather than clusters. The concatenated sequences outputted by phamb also cannot be considered as a complete or fragmented genome sequence for use in downstream analysis?

If I could receive your answer, I would greatly appreciate it.

joacjo commented 10 months ago

Hi actledge

The binning step with VAMB clusters contigs likely originating from the same genome (not similar contigs).

Therefore the Phamb workflow helps you in some cases recover more complete virus genomes as vMAGs, compared to not using and evaluating single-contigs. The suggested workflow is to run the binning step to find putative vMAGs, run CheckV and select bona fide viruses (i.e. MQ , HQ).

Hope this helps!

actledge commented 10 months ago

Hi actledge

The binning step with VAMB clusters contigs likely originating from the same genome (not similar contigs).

Therefore the Phamb workflow helps you in some cases recover more complete virus genomes as vMAGs, compared to not using and evaluating single-contigs. The suggested workflow is to run the binning step to find putative vMAGs, run CheckV and select bona fide viruses (i.e. MQ , HQ).

Hope this helps!

Hi joacjo,

Thank you very much for your response,it has been really helpful to me. But I still have a slight confusion because I am relatively new to metagenomic analysis. The genome order obtained from vMAGs is indeed shuffled, and although it seems that the shuffled order may not significantly affect gene annotation and completeness assessment. But does this also mean that the vMAGs obtained through such binning approach cannot be considered as a "genome" sequence and cannot be uploaded to public databases (such as NCBI) as a draft or complete genome? Or is it generally accepted to have shuffled order for vMAGs obtained from metagenomic sequencing?

Or is it because this is a strategy for virus research? Because as far as I understand, most of the current analysis software for phages or viruses defaults to treating a single sequence as a virus. Is it because of this reason that all sequences are concatenated together to form a single vMAG, instead of treating them as different "contigs" from the same genome like other species, and putting them in the same fasta file to indicate that they originate from the same genome?

Thanks!

joacjo commented 10 months ago

Hi actledge

Ah! If the purpose of your research is to identify and upload new and complete virus genomes to NCBI , then I would not recommend phamb which is more oriented towards virus research. For this purpose, I would advice you to run Genomad followed up by CheckV on the individual contigs.

There are some recommended and strict guidelines for submitting new and complete viruses to NCBI and they do not currently cover vMAGs. Checkout this paper: https://www.nature.com/articles/s41587-023-01844-2

RasmussenLab / phamb

Result assessment! #30