Closed ArtPoon closed 5 years ago
First we need to screen through the virus genome records for those that are not complete genomes.
> which(virus$Number.of.proteins==1)[1:10]
[1] 12 27 48 54 80 81 82 84 85 86
> virus[12,]
Family Genome Source.information Accession
12 Adenoviridae Bovine adenovirus 10 isolate:Ma268 NC_043093
Date.completed Date.updated Genome.length Number.of.proteins
12 06/28/2019 06/28/2019 571 1
Host n.overlaps
12 vertebrates 28
> orfs[orfs$accno=='NC_043093',]
accno product strand coords
334 NC_043093 hexon 1 0:571
aaseq
334 ASEYLSAGLVQFARATDSYFSLGNKFRNPTVAPTHDVTTERSRRLQLRFVPVDKEDTQYTYKTRFQLTVGDNRVLDMGSTYFDIRGVIDRGPSFKPYSGTAYNNLAPRSAPNNCFFKNDNGGHPDVAYAQLPFVGTREQQNLMVLNAEGQRVAADPIYQPEPQYGVDAWPQNRLGDFNAGRALKSDVTHL
This accession number does not actually appear in find_ovrfs.csv
. Processing error in R script.
We are seeing a large number of virus genomes with a single protein. This appears to be a misannotation in the NCBI database. For example, Bovine mastadenovirus C is listed as a virus genome with accession number NC_043093.1, but this record is labeled "Bovine adenovirus 10 isolate Ma268 hexon gene, partial cds".
In addition, this record has far too many entries in
find_ovrfs.csv
: