ArifaKhanLab / RVDB

A reference viral database (RVDB)
https://rvdb.dbi.udel.edu/
26 stars 10 forks source link

RVDB v16.0 most sequence contain date in FASTA header instead of sequence description #5

Closed peterk87 closed 4 years ago

peterk87 commented 5 years ago

Of the 2,820,860 sequences in the v16.0 FASTA file, 2,811,816 have headers with what appears to be a date like 25-AUG-2016 instead of the virus name or description. Only around 9044 sequences have what appear to be regular names.

For example:

>AB504233.1 25-AUG-2016

has the name Sapovirus Tamagawa River/Site2_a/Nov2003/JP gene for capsid protein, partial cds (https://www.ncbi.nlm.nih.gov/nuccore/AB504233.1). I'm not sure where the 25-AUG-2016 comes from.

a7032018 commented 4 years ago

The date 25-AUG-2016 is the release date for the record. The original fasta header downloaded from GenBank FTP contains this info