artic-network / fieldbioinformatics

The ARTIC field bioinformatics pipeline
MIT License
110 stars 69 forks source link

SNP filtering/masking/soft clipping with medaka #92

Open shimbalama opened 2 years ago

shimbalama commented 2 years ago

Hi - I have 2 SNP sites in a Sars Cov2 BAM that have become N in the resulting consensus sequence and I was hoping you could help me understand why. I am using V3 primer set. After reading Quick J et al. 2017 and these docs https://artic.readthedocs.io/en/latest/minion/ . My understanding of this paragraph - 'By softmasking, we refer to the process of adjusting the CIGAR of each alignment segment such that soft clips replace any reference or query consuming operations in regions of an alignment that fall outside of primer boundaries. The leftmost mapping position of alignments are also updated during softmasking.' - is that if the sequence in a read is meant to be excluded from contributing it will be soft clipped. This fits with what I see in BAMs in IGV. The following two SNP sites do not fit with my understanding of how this is meant to work:

Position 28111 which is found between - but not contained within - V3 primers sites 93_LEFT and 92_RIGHT. 93_LEFT ends at 28105 so it is 6 bp away.

VCFs:

grep 28111 *vcf
2270.fail.vcf:MN908947.3    28111   .   A   G   30.82   PASS    DP=4;AC=0,4;AM=0;MC=0;MF=0.0;MB=0.0;AQ=16.84;GM=1;PH=6.02,6.02,6.02,6.02;SC=None;   GT:GQ:PS:UG:UQ  1/1:9.19:.:1/1:9.19
2270.merged.vcf:MN908947.3  28111   .   A   G   30.82   PASS    DP=4;AC=0,4;AM=0;MC=0;MF=0.000;MB=0.000;AQ=16.84;GM=1;PH=6.02,6.02,6.02,6.02;SC=None;   GT:GQ:PS:UG:UQ  1/1:9.19:.:1/1:9.19
2270.nCoV-2019_1.vcf:MN908947.3 28111   .   A   G   43.555  PASS        GT:GQ   1:44
2270.nCoV-2019_2.vcf:MN908947.3 28111   .   A   G   17.292  PASS        GT:GQ   1:17

depths

grep 28111 *depths
2270.coverage_mask.txt.nCoV-2019_1.depths:MN908947.3    nCoV-2019_1 28111   84
2270.coverage_mask.txt.nCoV-2019_2.depths:MN908947.3    nCoV-2019_2 28111   4

igv_snapshot_28111_2

RG1 is pink (ish) and all 84 reads look to have been filtered out. Not sure why. RG2 (blue) has 4 reads (only 2 show above as only 2 are from that primer site) and SNP A>G at 28111 is in fail VCF due to having DP 4.

The second SNP site i'm interested in is 1596

VCFs:

grep 1596 *vcf
2270.fail.vcf:MN908947.3    1596    .   A   G   275.91  PASS    DP=19;AC=0,18;AM=1;MC=0;MF=0.0;MB=0.0;AQ=13.78;GM=1;PH=6.02,6.02,6.02,6.02;SC=None; GT:GQ:PS:UG:UQ  1/1:48.9:.:1/1:48.9
2270.merged.vcf:MN908947.3  1596    .   A   G   275.91  PASS    DP=19;AC=0,18;AM=1;MC=0;MF=0.000;MB=0.000;AQ=13.78;GM=1;PH=6.02,6.02,6.02,6.02;SC=None; GT:GQ:PS:UG:UQ  1/1:48.90:.:1/1:48.90
2270.nCoV-2019_1.vcf:MN908947.3 1596    .   A   G   44.158  PASS        GT:GQ   1:44
2270.nCoV-2019_2.vcf:MN908947.3 1596    .   A   G   36.182  PASS        GT:GQ   1:36

depths

grep 1596 *depths
2270.coverage_mask.txt.nCoV-2019_1.depths:MN908947.3    nCoV-2019_1 1596    19
2270.coverage_mask.txt.nCoV-2019_2.depths:MN908947.3    nCoV-2019_2 1596    31

igv_snapshot_1596

This one is at the edge of the soft clipping. All 31 RG2 (blue) reads are filtered out by something leaving 19 RG1 (pink) and thus and N in the consensus sequence.

Is this behavior intended? Can you please tell me why one RG is being filtered out in this region of overlap between primer sites?

Kind Regards, Liam