ArifaKhanLab / RVDB

A reference viral database (RVDB)
https://rvdb.dbi.udel.edu/
26 stars 10 forks source link

Incorrect virus names in U-RVDBv12.2-prot.fasta #2

Closed terrycojones closed 6 years ago

terrycojones commented 6 years ago

Hi. First of all, thanks so much for all this work, it looks like a very promising middle ground between the small selection of the refseq database and the unruly enormous nt database.

I have found a couple of cases of what should be virus names (in [...] at the end of the sequence ids) that are actually species names.

$ grep -F '[Gorilla gorilla]' U-RVDBv12.2-prot.fasta
>acc|GENBANK|CAE12263.1|GENBANK|AJ577596|FRD envelope protein [Gorilla gorilla]
>acc|GENBANK|AAM68167.1|GENBANK|AY101588|envelope glycoprotein [Gorilla gorilla]
>acc|GENBANK|AAM68168.1|GENBANK|AY101589|envelope glycoprotein [Gorilla gorilla]
>acc|GENBANK|ABB73024.1|GENBANK|DQ256474|syncytin 1 [Gorilla gorilla]
>acc|GENBANK|AGI61266.1|GENBANK|KC010498|envelope protein ENVV1 [Gorilla gorilla]
>acc|GENBANK|AGI61275.1|GENBANK|KC010510|envelope protein ENVV2 [Gorilla gorilla]

and the same occurs for a grep on [Homo sapiens], though with many more hits:

$ grep -F '[Homo sapiens]' U-RVDBv12.2-prot.fasta | head -n 10
>acc|GENBANK|BAB47555.1|GENBANK|AB050996|envelope protein [Homo sapiens]
>acc|GENBANK|BAB47556.1|GENBANK|AB050999|envelope protein [Homo sapiens]
>acc|GENBANK|BAB47557.1|GENBANK|AB051000|envelope protein [Homo sapiens]
>acc|GENBANK|BAB47558.1|GENBANK|AB051004|envelope protein [Homo sapiens]
>acc|GENBANK|BAB47559.1|GENBANK|AB051007|envelope protein [Homo sapiens]
>acc|GENBANK|BAB47560.1|GENBANK|AB051008|envelope protein [Homo sapiens]
>acc|GENBANK|BAB47561.1|GENBANK|AB051009|envelope protein [Homo sapiens]
>acc|GENBANK|BAB47562.1|GENBANK|AB051010|envelope protein [Homo sapiens]
>acc|GENBANK|BAG06168.1|GENBANK|AB266802|putative retroviral envelope protein [Homo sapiens]
>acc|GENBANK|BAN04646.1|GENBANK|AB610407|suppressyn [Homo sapiens]

Regards, Terry

terrycojones commented 6 years ago

This seems to only be an issue with the protein version of RVDB. I'll ask them. Feel free to close this.

terrycojones commented 6 years ago

I've mailed Marc & Thomas about this. Closing.

ArifaKhanLab commented 6 years ago

They are endogenous (retro)virus (EN(R)V) sequences existing in their host's genome. We are working on the solutions to determine the sequences belong to host or EN(R)V for those particular cases.

terrycojones commented 6 years ago

Thanks @ArifaKhanLab !