ababaian / serratus

Ultra-deep search for novel viruses
http://serratus.io
GNU General Public License v3.0
254 stars 33 forks source link

Summarizer Updates for next version #134

Closed ababaian closed 4 years ago

ababaian commented 4 years ago

Keep track of updates needing to go into summarizer as we learn more from this data-set for the next update.

  1. aln=38;glb=3;len=221454;cvgpct=12;len=221454;depth=38;. The aln field is the same as the depth field for all records. This should be calculated as depth = aln * readlen / length

  2. "Another bug is toplen=1; in many of the Covs, this is in the meta-data."

ababaian commented 4 years ago

There is an error in the cov3ma.sumzer.tsv file coming from the covref3.sumzer.tsv intput.

Consider these CoV sequences, where the second column is the "length". It appears the offset was placed there instead.

KY370052.1  221 Rodent coronavirus isolate RtMm-CoV-1/IM2014 ORF1ab polyprotein, hemagglutinin-esterase protein, spike glycoprotein, envelope protein, membrane protein, and nucleocapsid protein genes, complete cds   Coronaviridae   221 30000
KY370051.1  1   Rodent coronavirus isolate RtBi-CoV/FJ2015 ORF1ab polyprotein, hemagglutinin-esterase protein, spike glycoprotein, envelope protein, membrane protein, and nucleocapsid protein genes, complete cds Coronaviridae   1   30000
KY370050.1  38  Rodent coronavirus isolate RtRl-CoV/FJ2015 ORF1ab polyprotein, spike glycoprotein, envelope protein, membrane protein, and nucleocapsid protein genes, complete cds Coronaviridae   38  30000
KY370049.1  1   Rodent coronavirus isolate RtNn-CoV/SAX2015 ORF1ab polyprotein, hemagglutinin-esterase protein, spike glycoprotein, envelope protein, membrane protein, and nucleocapsid protein genes, complete cds    Coronaviridae   1   30000
KY370048.1  1   Rodent coronavirus isolate RtMm-CoV/GD2015 ORF1ab polyprotein, hemagglutinin-esterase protein, spike glycoprotein, envelope protein, membrane protein, and nucleocapsid protein genes, complete cds Coronaviridae   1   30000
KY370047.1  2   Rodent coronavirus isolate RtAp-CoV/Tibet2014 ORF1ab polyprotein, hemagglutinin-esterase protein, spike glycoprotein, envelope protein, membrane protein, and nucleocapsid protein genes, complete cds  Coronaviridae   2   30000
KY370046.1  8   Rodent coronavirus isolate RtMruf-CoV-2/JL2014 ORF1ab polyprotein, hemagglutinin-esterase protein, spike glycoprotein, envelope protein, membrane protein, and nucleocapsid protein genes, complete cds Coronaviridae   8   30000
KY370045.1  52  Rodent coronavirus isolate RtMruf-CoV-1/JL2014 ORF1ab polyprotein, spike glycoprotein, envelope protein, membrane protein, and nucleocapsid protein genes, complete cds Coronaviridae   52  30000
KY370044.1  1   Rodent coronavirus isolate RtAs-CoV/IM2014 ORF1ab polyprotein, hemagglutinin-esterase protein, spike glycoprotein, envelope protein, membrane protein, and nucleocapsid protein genes, complete cds Coronaviridae   1   30000
KY370043.1  1   Rodent coronavirus isolate RtRn-CoV/YN2013 ORF1ab polyprotein, hemagglutinin-esterase protein, spike glycoprotein, envelope protein, membrane protein, and nucleocapsid protein genes, complete cds Coronaviridae   1   30000

These are very short lengths, (1, 52 etc..) but in the fasta index file for instance these are whole genomes

KY370047.1  31290   27117470    60  61
KY370046.1  31393   27149294    60  61
KY370045.1  29197   27181223    60  61
rcedgar commented 4 years ago

Fixed in recent merged PR.