Closed ababaian closed 4 years ago
There is an error in the cov3ma.sumzer.tsv
file coming from the covref3.sumzer.tsv
intput.
Consider these CoV sequences, where the second column is the "length". It appears the offset was placed there instead.
KY370052.1 221 Rodent coronavirus isolate RtMm-CoV-1/IM2014 ORF1ab polyprotein, hemagglutinin-esterase protein, spike glycoprotein, envelope protein, membrane protein, and nucleocapsid protein genes, complete cds Coronaviridae 221 30000
KY370051.1 1 Rodent coronavirus isolate RtBi-CoV/FJ2015 ORF1ab polyprotein, hemagglutinin-esterase protein, spike glycoprotein, envelope protein, membrane protein, and nucleocapsid protein genes, complete cds Coronaviridae 1 30000
KY370050.1 38 Rodent coronavirus isolate RtRl-CoV/FJ2015 ORF1ab polyprotein, spike glycoprotein, envelope protein, membrane protein, and nucleocapsid protein genes, complete cds Coronaviridae 38 30000
KY370049.1 1 Rodent coronavirus isolate RtNn-CoV/SAX2015 ORF1ab polyprotein, hemagglutinin-esterase protein, spike glycoprotein, envelope protein, membrane protein, and nucleocapsid protein genes, complete cds Coronaviridae 1 30000
KY370048.1 1 Rodent coronavirus isolate RtMm-CoV/GD2015 ORF1ab polyprotein, hemagglutinin-esterase protein, spike glycoprotein, envelope protein, membrane protein, and nucleocapsid protein genes, complete cds Coronaviridae 1 30000
KY370047.1 2 Rodent coronavirus isolate RtAp-CoV/Tibet2014 ORF1ab polyprotein, hemagglutinin-esterase protein, spike glycoprotein, envelope protein, membrane protein, and nucleocapsid protein genes, complete cds Coronaviridae 2 30000
KY370046.1 8 Rodent coronavirus isolate RtMruf-CoV-2/JL2014 ORF1ab polyprotein, hemagglutinin-esterase protein, spike glycoprotein, envelope protein, membrane protein, and nucleocapsid protein genes, complete cds Coronaviridae 8 30000
KY370045.1 52 Rodent coronavirus isolate RtMruf-CoV-1/JL2014 ORF1ab polyprotein, spike glycoprotein, envelope protein, membrane protein, and nucleocapsid protein genes, complete cds Coronaviridae 52 30000
KY370044.1 1 Rodent coronavirus isolate RtAs-CoV/IM2014 ORF1ab polyprotein, hemagglutinin-esterase protein, spike glycoprotein, envelope protein, membrane protein, and nucleocapsid protein genes, complete cds Coronaviridae 1 30000
KY370043.1 1 Rodent coronavirus isolate RtRn-CoV/YN2013 ORF1ab polyprotein, hemagglutinin-esterase protein, spike glycoprotein, envelope protein, membrane protein, and nucleocapsid protein genes, complete cds Coronaviridae 1 30000
These are very short lengths, (1, 52 etc..) but in the fasta index file for instance these are whole genomes
KY370047.1 31290 27117470 60 61
KY370046.1 31393 27149294 60 61
KY370045.1 29197 27181223 60 61
Fixed in recent merged PR.
Keep track of updates needing to go into summarizer as we learn more from this data-set for the next update.
aln=38;glb=3;len=221454;cvgpct=12;len=221454;depth=38;
. Thealn
field is the same as thedepth
field for all records. This should be calculated as depth = aln * readlen / length"Another bug is toplen=1; in many of the Covs, this is in the meta-data."