AnantharamanLab / VIBRANT

Virus Identification By iteRative ANnoTation
GNU General Public License v3.0
142 stars 37 forks source link

About the use of Pfam HMM files #36

Closed bingxue-98 closed 3 years ago

bingxue-98 commented 3 years ago

Hello, I have a question about the workflow of vibrant. I've noticed that in the non-neural network step, 75-85% of non-viral scaffolds are removed according to the results of KEGG and Pfam annotations. Scaffolds with less than 15 total or density under 60% Pfam annotations are retained. For Pfam annotation, all Pfam HMM files are used as datasets. However, viral proteins are included in these HMM files. I am wondering are there any chances to misidentify a viral scaffold as procaryotic one based on the number of Pfam annotations in this step? Thank you.

KrisKieft commented 3 years ago

Hi,

The chance is there, but it's very low. The other metric involved, besides total number of Pfam annotations, is v-scores. You can check out additional file 19 in the manuscript for further details. V-score will take into account the virus-like nature of the annotations. So a scaffold may have many Pfam annotations with high v-scores and be retained through the netural network step. That way if it is annotated with viral Pfams it will not be thrown out. I hope that clears things up.

Kris