faylward / viralrecall

Detection of NCLDV signatures in 'omic data
30 stars 11 forks source link

Assigning taxonomy on NCLDVs? The missing link. #15

Closed JSSaini closed 1 year ago

JSSaini commented 2 years ago

I have obtained hundreds of promising NCLDV genomes from the Lake Cadagno metagenomics dataset. However, I am struggling in assigning taxonomy to them. There are missing names in the "spreadsheet of annotated genomes" which are present in the GVDB databases. https://faylward.github.io/GVDB/

Please suggest some steps/script for assigning taxonomy on prospective NCLDV genomes. Thank you.

faylward commented 2 years ago

Thanks for your interest! It sounds like you have some very interesting data. Just to make sure we are talking about the same thing, to assign taxonomy confidently it is necessary to bin contigs and get draft genomes. Assigning taxonomy to contigs individually is very difficult- some are very short and lack marker genes, for example. There are many ways to do binning- simple tools like MetaBat2 actually do a fairly good job, but there are other alternatives (see https://merenlab.org/2022/01/03/giant-viruses/).

If you already have bins/genomes, then I would recommend making a phylogeny using ncldv_markersearch with your genomes together with references (https://github.com/faylward/ncldv_markersearch). The default options of this tool use 7 marker genes to make the concatenated alignment, and you can then use IQ-TREE to make the final tree.

As for the GVDB- please give me an example of a virus that is missing. Please note that some viruses with aberrant phylogenetic placement ("rogue taxa") were removed.

JSSaini commented 2 years ago

Thanks for your prompt reply. Yes, I followed the customized binning using MetaBat2 as mentioned in (https://www.nature.com/articles/s41586-020-1957-x#Sec2). Then I obtained promising NCLDV genomes by following quality assessment through ViralRecall (score > 1). After collecting all the promising (score>1) NCLDV genomes from the Lake Cadgano water column, I used dRep to get only representative/unique genomes (n=153). Respecting your suggestions, it seems like I am at the phylogeny step and thanks a lot for sharing the ncldv_markersearch script. :)

faylward commented 2 years ago

Sounds good- feel free to email if you have more specific questions: faylward at vt.edu. I am compiling a reduced set of reference genomes that may be easier to use for trees.

JSSaini commented 2 years ago

The ncldv_markersearch script worked perfectly and yielded the alignment file for the tree. Please let me know where I can find reference NCLDV genomes so I can add these along? A reduced set would be great. Thank you.

Achuan-2 commented 2 years ago

Thanks for your prompt reply. Yes, I followed the customized binning using MetaBat2 as mentioned in (https://www.nature.com/articles/s41586-020-1957-x#Sec2). Then I obtained promising NCLDV genomes by following quality assessment through ViralRecall (score > 1). After collecting all the promising (score>1) NCLDV genomes from the Lake Cadgano water column, I used dRep to get only representative/unique genomes (n=153). Respecting your suggestions, it seems like I am at the phylogeny step and thanks a lot for sharing the ncldv_markersearch script. :)

hello,can I ask how you get promising (score>1) NCLDV genomes , score >1 for contig or whole bin? do you use viralrecall "--minscore 1" parameter or use "-c" and then filter all contigs that score >1

JSSaini commented 2 years ago

I used -c flag first and then considered the mean score of all contigs for each bin. Mean score (>1) for each bin was considered potentially as NCLDV.

Achuan-2 commented 2 years ago

I used -c flag first and then considered the mean score of all contigs for each bin. Mean score (>1) for each bin was considered potentially as NCLDV.

thanks a lot , did you filter contigs that score <0 ? Because I use viralrecall to try to recover NCLDV from animal gut metagenome, but can't find a bin statisfy mean score >1

faylward commented 2 years ago

I wouldn't be surprised if there are no NCLDV bins in an animal gut metagenome- I've surveyed quite a few gut metagenomes and they typically don't have many NCLDV (most of the viruses I pick up are Caudovirales). You could manually check a few borderline cases just to be sure, but unless you have another reason to think that NCLDV should be present then there may not be any.

JSSaini commented 2 years ago

I calculated the scores on all contigs (-c flag) of each bin. And then imported the ViralRecall output inside R and then calculated the mean score per bin.

Achuan-2 commented 2 years ago

I got it, thank you both very much!