Error in computing overlaps

ArtPoon commented 5 years ago

We are seeing a large number of virus genomes with a single protein. This appears to be a misannotation in the NCBI database. For example, Bovine mastadenovirus C is listed as a virus genome with accession number NC_043093.1, but this record is labeled "Bovine adenovirus 10 isolate Ma268 hexon gene, partial cds".

In addition, this record has far too many entries in find_ovrfs.csv:

> overlaps[overlaps$accn=='NC_043093',]
            accn                          prod1   loc1 dir1
106352 NC_043093           hypothetical protein   2722   -1
106353 NC_043093           hypothetical protein   3021    1
106354 NC_043093                 transactivator   9623   -1
106355 NC_043093           hypothetical protein  15602    1
106356 NC_043093           hypothetical protein  18178   -1
106357 NC_043093       antigenic virion protein  19957   -1
106358 NC_043093                   glycoprotein  34850   -1
106359 NC_043093 Polymerase processivity factor  38918   -1
106360 NC_043093        capsid assembly protein  42998    1
106361 NC_043093        putative virion protein  54205   -1
106362 NC_043093                 virion protein  55371    1
106363 NC_043093                 Glycoprotein B  60707   -1
106364 NC_043093      transport/capsid assembly  63153   -1
106365 NC_043093               putative dUTPase  75142   -1
106366 NC_043093                  viron protein  82042    1
106367 NC_043093           hypothetical protein  97527    1
106368 NC_043093             Putative terminase  98576   -1
106369 NC_043093           hypothetical protein  99920    1
106370 NC_043093               tegument protein 100554    1
106371 NC_043093               tegument protein 101839    1
106372 NC_043093           hypothetical protein 103755    1
106373 NC_043093           hypothetical protein 104816    1
106374 NC_043093    Myristylated virion protein 108265    1
106375 NC_043093       helicase/primase complex 111950    1
106376 NC_043093         putative viron protein 114631   -1
106377 NC_043093       Helicase/primase complex 116414    1
106378 NC_043093           hypothetical protein 156043   -1
106379 NC_043093           hypothetical protein 156342    1

ArtPoon commented 5 years ago

First we need to screen through the virus genome records for those that are not complete genomes.

ArtPoon commented 5 years ago

> which(virus$Number.of.proteins==1)[1:10]
 [1] 12 27 48 54 80 81 82 84 85 86
> virus[12,]
         Family               Genome Source.information Accession
12 Adenoviridae Bovine adenovirus 10      isolate:Ma268 NC_043093
   Date.completed Date.updated Genome.length Number.of.proteins
12     06/28/2019   06/28/2019           571                  1
          Host n.overlaps
12 vertebrates         28
> orfs[orfs$accno=='NC_043093',]
        accno product strand coords
334 NC_043093   hexon      1  0:571
                                                                                                                                                                                             aaseq
334 ASEYLSAGLVQFARATDSYFSLGNKFRNPTVAPTHDVTTERSRRLQLRFVPVDKEDTQYTYKTRFQLTVGDNRVLDMGSTYFDIRGVIDRGPSFKPYSGTAYNNLAPRSAPNNCFFKNDNGGHPDVAYAQLPFVGTREQQNLMVLNAEGQRVAADPIYQPEPQYGVDAWPQNRLGDFNAGRALKSDVTHL

ArtPoon commented 5 years ago

This accession number does not actually appear in find_ovrfs.csv. Processing error in R script.

PoonLab / ovrf-viz

Error in computing overlaps #4