UPHL-BioNGS / Cecret

Reference-based consensus creation
MIT License
44 stars 22 forks source link

Update 3.12.20240221 #298

Closed erinyoung closed 4 months ago

erinyoung commented 4 months ago
erinyoung commented 4 months ago

When comparing with and without bbnorm:

==> cecret/samtools_coverage/3528365.cov.hist <==
MN908947.3 (29.9Kbp)
>  90.00% │▁███████████████████████████████████▇██████████▇█ │ Number of reads: 199978
>  80.00% │██████████████████████████████████████████████████│ 
>  70.00% │██████████████████████████████████████████████████│ Covered bases:   29.8Kbp
>  60.00% │██████████████████████████████████████████████████│ Percent covered: 99.59%
>  50.00% │██████████████████████████████████████████████████│ Mean coverage:   862x
>  40.00% │██████████████████████████████████████████████████│ Mean baseQ:      32.8
>  30.00% │██████████████████████████████████████████████████│ Mean mapQ:       60
>  20.00% │██████████████████████████████████████████████████│ 
>  10.00% │██████████████████████████████████████████████████│ Histo bin width: 598bp
>   0.00% │██████████████████████████████████████████████████│ Histo max bin:   100%
          1        6.0K     12.0K     17.9K     23.9K      29.9K  

==> cecret/samtools_coverage/3540826-UT-A01290-240207.cov.hist <==
MN908947.3 (29.9Kbp)
>  90.00% │▇███████████████████████████████████▇██████████▇█▁│ Number of reads: 71398
>  80.00% │██████████████████████████████████████████████████│ 
>  70.00% │██████████████████████████████████████████████████│ Covered bases:   29.8Kbp
>  60.00% │██████████████████████████████████████████████████│ Percent covered: 99.72%
>  50.00% │██████████████████████████████████████████████████│ Mean coverage:   325x
>  40.00% │██████████████████████████████████████████████████│ Mean baseQ:      36.1
>  30.00% │██████████████████████████████████████████████████│ Mean mapQ:       60
>  20.00% │██████████████████████████████████████████████████│ 
>  10.00% │██████████████████████████████████████████████████│ Histo bin width: 598bp
>   0.00% │██████████████████████████████████████████████████│ Histo max bin:   100%
          1        6.0K     12.0K     17.9K     23.9K      29.9K  

==> nonorm/samtools_coverage/3528365.cov.hist <==
MN908947.3 (29.9Kbp)
>  90.00% │▇███████████████████████████████████▇██████████▇█▁│ Number of reads: 565714
>  80.00% │██████████████████████████████████████████████████│ 
>  70.00% │██████████████████████████████████████████████████│ Covered bases:   29.8Kbp
>  60.00% │██████████████████████████████████████████████████│ Percent covered: 99.73%
>  50.00% │██████████████████████████████████████████████████│ Mean coverage:   2.44e+03x
>  40.00% │██████████████████████████████████████████████████│ Mean baseQ:      32.7
>  30.00% │██████████████████████████████████████████████████│ Mean mapQ:       60
>  20.00% │██████████████████████████████████████████████████│ 
>  10.00% │██████████████████████████████████████████████████│ Histo bin width: 598bp
>   0.00% │██████████████████████████████████████████████████│ Histo max bin:   100%
          1        6.0K     12.0K     17.9K     23.9K      29.9K  

==> nonorm/samtools_coverage/3540826-UT-A01290-240207.cov.hist <==
MN908947.3 (29.9Kbp)
>  90.00% │████████████████████████████████████▇████████████▁│ Number of reads: 22334310
>  80.00% │██████████████████████████████████████████████████│ 
>  70.00% │██████████████████████████████████████████████████│ Covered bases:   29.8Kbp
>  60.00% │██████████████████████████████████████████████████│ Percent covered: 99.79%
>  50.00% │██████████████████████████████████████████████████│ Mean coverage:   1.02e+05x
>  40.00% │██████████████████████████████████████████████████│ Mean baseQ:      36.2
>  30.00% │██████████████████████████████████████████████████│ Mean mapQ:       60
>  20.00% │██████████████████████████████████████████████████│ 
>  10.00% │██████████████████████████████████████████████████│ Histo bin width: 598bp
>   0.00% │██████████████████████████████████████████████████│ Histo max bin:   100%
          1        6.0K     12.0K     17.9K     23.9K      29.9K  
erinyoung commented 4 months ago

And the final summary file

==> cecret/cecret_results.txt <==
sample_id   sample  pangolin_lineage    nextclade_clade vadr_p/f    fasta_line  fastqc_raw_reads_1  fastqc_raw_reads_2  num_N   num_total   seqyclean_PairsKept seqyclean_Perc_Kept num_pos_100X    insert_size_after_trimming  bcftools_variants_identified    samtools_meandepth_after_trimming   samtools_per_1X_coverage_after_trimming vadr_model  vadr_alerts nextclade_clade_who nextclade_qc_overallscore   nextclade_qc_overallstatus  pangolin_conflict   pangolin_ambiguity_score    pangolin_scorpio_call   pangolin_scorpio_support    pangolin_scorpio_conflict   pangolin_scorpio_notes  pangolin_version    pangolin_pangolin_version   pangolin_scorpio_version    pangolin_constellation_version  pangolin_is_designated  pangolin_qc_status  pangolin_qc_notes   pangolin_note   pangocollapse_lineage   pangocollapse_Lineage_full  pangocollapse_Lineage_expanded  pangocollapse_Lineage_family    freyja_summarized   Cecret version  seqyclean   bwa ivar    ivar consensus
3528365 3528365 XCR recombinant PASS    3528365 325979.0    325979.0    718 29759   109728.0    97.3586 29040   171.0   122 861.881 99.5887 NC_045512   -   recombinant 3.396763    good    0.0     Omicron (XBB.1.5-like)  0.94    0.01    scorpio call: Alt alleles 82; Ref alleles 1; Amb alleles 1; Oth alleles 3   PUSHER-v1.25.1  4.3.1   0.3.19  v0.1.12 False   pass    Ambiguous content: 4%   Usher placements: XCR(1/1); scorpio lineage XBB.1.5 conflicts with inference lineage XCR    XCR XCR XCR Recombinant [('Other'  0.9999999999996719)] v3.12.20240221  seqyclean : Version: 1.10.09 (2018-10-16)   bwa : Version: 0.7.17-r1188 ivar : iVar version 1.4.2   iVar version 1.4.2
3540826-UT-A01290-240207    3540826-UT-A01290-240207    JN.1.1  23I PASS    3540826-UT-A01290-240207    12181621.0  12181621.0  111 29796   48071.0 87.1799 29685   187.6   132 325.301 99.7224 NC_045512   -   Omicron 0.0 good    0.0     Omicron (BA.2-like) 0.92    0.03    scorpio call: Alt alleles 57; Ref alleles 2; Amb alleles 0; Oth alleles 3   PUSHER-v1.25.1  4.3.1   0.3.19  v0.1.12 False   pass    Ambiguous content: 2%   Usher placements: JN.1.1(1/1)   JN.1.1  B.1.1.529.2.86.1.1.1    B.1.1.529:BA.2.86.1:JN.1.1  BA.2    [('BA.2.86* (BA.2.86X)'  0.999999999994306)]    v3.12.20240221  seqyclean : Version: 1.10.09 (2018-10-16)   bwa : Version: 0.7.17-r1188 ivar : iVar version 1.4.2   iVar version 1.4.2
bbnorm_test bbnorm  JN.1.1  23I PASS    bbnorm_test 55140.0 55140.0 111 29796   44283.0 87.4588 29685   188.2   132 304.238 99.7224 NC_045512   -   Omicron 0.0 good    0.0     Omicron (BA.2-like) 0.92    0.03    scorpio call: Alt alleles 57; Ref alleles 2; Amb alleles 0; Oth alleles 3   PUSHER-v1.25.1  4.3.1   0.3.19  v0.1.12 False   pass    Ambiguous content: 2%   Usher placements: JN.1.1(1/1)   JN.1.1  B.1.1.529.2.86.1.1.1    B.1.1.529:BA.2.86.1:JN.1.1  BA.2    [('BA.2.86* (BA.2.86X)'  0.9999999999906976)]   v3.12.20240221  seqyclean : Version: 1.10.09 (2018-10-16)   bwa : Version: 0.7.17-r1188 ivar : iVar version 1.4.2   iVar version 1.4.2

==> nonorm/cecret_results.txt <==
sample_id   sample  pangolin_lineage    nextclade_clade vadr_p/f    fasta_line  fastqc_raw_reads_1  fastqc_raw_reads_2  num_N   num_total   seqyclean_PairsKept seqyclean_Perc_Kept num_pos_100X    insert_size_after_trimming  bcftools_variants_identified    samtools_meandepth_after_trimming   samtools_per_1X_coverage_after_trimming vadr_model  vadr_alerts nextclade_clade_who nextclade_qc_overallscore   nextclade_qc_overallstatus  pangolin_conflict   pangolin_ambiguity_score    pangolin_scorpio_call   pangolin_scorpio_support    pangolin_scorpio_conflict   pangolin_scorpio_notes  pangolin_version    pangolin_pangolin_version   pangolin_scorpio_version    pangolin_constellation_version  pangolin_is_designated  pangolin_qc_status  pangolin_qc_notes   pangolin_note   pangocollapse_lineage   pangocollapse_Lineage_full  pangocollapse_Lineage_expanded  pangocollapse_Lineage_family    freyja_summarized   Cecret version  seqyclean   bwa ivar    ivar consensus
3528365 3528365 XCR recombinant PASS    3528365 325979.0    325979.0    758 29801   316176.0    96.9928 29043   170.8   125 2442.1  99.7325 NC_045512   -   recombinant 3.877421    good    0.0     Omicron (XBB.1.5-like)  0.94    0.01    scorpio call: Alt alleles 82; Ref alleles 1; Amb alleles 1; Oth alleles 3   PUSHER-v1.25.1  4.3.1   0.3.19  v0.1.12 False   pass    Ambiguous content: 4%   Usher placements: XCR(1/1); scorpio lineage XBB.1.5 conflicts with inference lineage XCR    XCR XCR XCR Recombinant [('Other'  0.9999999999965108)] v3.12.20240221  seqyclean : Version: 1.10.09 (2018-10-16)   bwa : Version: 0.7.17-r1188 ivar : iVar version 1.4.2   iVar version 1.4.2
3540826-UT-A01290-240207    3540826-UT-A01290-240207    JN.1.1  23I PASS    3540826-UT-A01290-240207    12181621.0  12181621.0  14  29805   11326208.0  92.9778 29815   186.5   137 101836.0    99.7927 NC_045512   -   Omicron 0.0 good    0.0     Omicron (BA.2-like) 0.92    0.03    scorpio call: Alt alleles 57; Ref alleles 2; Amb alleles 0; Oth alleles 3PUSHER-v1.25.1 4.3.1   0.3.19  v0.1.12 False   pass    Ambiguous content: 2%   Usher placements: JN.1.1(1/1)   JN.1.1  B.1.1.529.2.86.1.1.1    B.1.1.529:BA.2.86.1:JN.1.1  BA.2    [('BA.2.86* (BA.2.86X)'  0.9977042473353005)]   v3.12.20240221  seqyclean : Version: 1.10.09 (2018-10-16)   bwa : Version: 0.7.17-r1188 ivar : iVar version 1.4.2   iVar version 1.4.2
bbnorm_test bbnorm  JN.1.1  23I PASS    bbnorm_test 55140.0 55140.0 111 29796   48071.0 87.1799 29685   187.6   132 325.301 99.7224 NC_045512   -   Omicron 0.0 good    0.0     Omicron (BA.2-like) 0.92    0.03    scorpio call: Alt alleles 57; Ref alleles 2; Amb alleles 0; Oth alleles 3   PUSHER-v1.25.1  4.3.1   0.3.19  v0.1.12 False   pass    Ambiguous content: 2%   Usher placements: JN.1.1(1/1)   JN.1.1  B.1.1.529.2.86.1.1.1    B.1.1.529:BA.2.86.1:JN.1.1  BA.2    [('BA.2.86* (BA.2.86X)'  0.999999999994306)]    v3.12.20240221  seqyclean : Version: 1.10.09 (2018-10-16)   bwa : Version: 0.7.17-r1188 ivar : iVar version 1.4.2   iVar version 1.4.2
erinyoung commented 4 months ago

Notably, normalization should not be used on wastewater or mixed samples.

In general, bbnorm appears to slightly increase the number of "N"s in the sequence (14 -> 111 for 3540826), which reduces the number of variants observed (137 -> 135 for 3540826). It does not seem to impact Freyja or Pangolin overall results, but there may be key variants that end up missing.

It DOES speed up runtime. By... a lot for samples with a lot of reads.

These three samples without normalization : 1 h 44 m 5 s These three samples with normalization : 21 m 7 s