faylward / GVDB

Giant Virus Database
3 stars 0 forks source link

Some questions about identification of 5 NCLDV marker genes in bins #1

Closed ZongzhiWu closed 2 years ago

ZongzhiWu commented 2 years ago

Dear faylward, I recently tried to reconstruct NCLDV bins from my metagenomics. I mainly refer to methods in your article "Dynamic genome evolution and complex virocell metabolism of globally-distributed giant viruses". I found you identified 5 NCLDV marker genes in bins (Major Capsid Protein (MCP), Superfamily II helicase (SFII), Virus-like transcription factor (VLTF3), DNA Polymerase B (PolB), and packaging ATPase (A32)) with HMMER. But I don't know how to set the e-value and score cutoff for this step. And you recently developed the GVDB (https://github.com/faylward/GVDB), I wonder if the HMM of five marker genes in the GVDB can be used for screening bins with 5 markers, and what is the HMMER cutoff when using the GVDB?

ZongzhiWu commented 2 years ago

I also refer to Frederik Schulz's methods in his article "Giant virus diversity and host interactions through global metagenomics". It seemed that he combined untargeted binning and NCLDVs targeted binning to reconstruct NCLDVs bins. Do you know the difference between untargeted and targeted binning? image

faylward commented 2 years ago

For each HMM we have score cutoffs we typically use (bit scores, not e-values). If you want to screen the bins to ensure there are no anomalous contigs you could use viralrecall, or if you wanted to make a concatenated protein alignment for a tree you could use ncldv_markersearch (both on my GitHub, both with HMM score cutoffs used in the results). The main challenge is making sure you are comfortable with your bins- i.e. assessing if they have non-viral contamination or possibly represent multiple distinct viruses that are binned together. If the bins look good (no multiple copies of PolB, for example) then I would go ahead with phylogenetic placement to see where they fall. Good luck!

ZongzhiWu commented 2 years ago

A candidate class ‘Mirusviricetes‘ recently discovered in TARA ocean metagenomics published on Biorxiv https://doi.org/10.1101/2021.12.27.474232. ‘Mirusviricetes‘ lacks the gene for MCP, but have many other proteins that other NCLDVs does not code for, for example TATA-binding proteins, histones, proteases and viral rhodopsins. This is a surprising finding. And what do you think of such novel class ‘Mirusviricetes‘ that not like common NCLDVs. Best wishes!

faylward commented 2 years ago

Yes this is very interesting! Note that MCP is not needed for using the tools I mentioned- viralrecall and ncldv_markersearch will work with just a few conserved markers (PolB, RNAP, etc). Other viruses that also lack a canonical MCP, such as pandoraviruses and pithoviruses, are also included in the GVDB. If you are concerned about phylogenetic placement you could look at individual marker gene trees to see if they are consistent.

ZongzhiWu commented 2 years ago

Thank a lot for your patience! Good luck~

faylward commented 2 years ago

All the best on your work!