Possible bug in filtering of indels

gevro commented 6 months ago

Hi, In recent samples analyzed with v3.5.5, we see very high indel burdens.

I found several examples of indels that overlap the NOISE mask, but were not filtered out - they still show up with FILTER=PASS.

$ tabix NOISE.sorted.GRCh38.bed.gz chr1:43861432-43861436 chr1 43861434 43861453

In results.indel.vcf.gz: chr1 43861434 . GCTGTCAGGACTTGTATAGA G 225.417 PASS INDEL;IDV=8;IMF=1;DP=8;VDB=1.33331e-05;SGB=-0.651104;MQSBZ=0;BQBZ=0;MQ0F=0;AC=1;AN=1;DP4=0,0,8,0;MQ=60;QPOS=97;RB=chr1,43861337,43861899,GGC,TAT;BBEG=43861337;BEND=43861899;DEPTH_FWD=5;DEPTH_REV=3;DEPTH_NORM_FWD=12;DEPTH_NORM_REV=12;DPLX_ASXS=119;DPLX_CLIP=0;DPLX_NM=19;BULK_ASXS=90;BULK_NM=0;NN=[0:118:0];SEQ=GGTGTGGAGCTGTCAGGA GT:PL:DP:DV:SP:DP4 1:255,0:8:8:0:0,0,8,0 chr1 43861434 . GCTGTCAGGACTTGTATAGA G 225.417 PASS INDEL;IDV=8;IMF=1;DP=8;VDB=0.125998;SGB=-0.651104;MQSBZ=0;BQBZ=0;MQ0F=0;AC=1;AN=1;DP4=0,0,4,4;MQ=60;QPOS=97;RB=chr1,43861337,43861546,AAG,TGA;BBEG=43861337;BEND=43861546;DEPTH_FWD=4;DEPTH_REV=4;DEPTH_NORM_FWD=12;DEPTH_NORM_REV=12;DPLX_ASXS=109;DPLX_CLIP=0;DPLX_NM=19;BULK_ASXS=90;BULK_NM=0;NN=[0:118:0];SEQ=GGTGTGGAGCTGTCAGGA GT:PL:DP:DV:SP:DP4 1:255,0:8:8:0:0,0,4,4

I am confident that I ran the pipeline with NOISE.sorted.GRCh38.bed.gz.

Is there a possible bug in applying the NOISE (and/or SNP) masks to indel results?

Thanks

gevro commented 6 months ago

Also, when you all created the human SNP filters, does it also include indels above a certain population AF threshold, or only SNPs? Per your paper you include population frequent indels that do not have the PASS flag in gnomad, but do you include indels that DO have the PASS flag in the SNP mask?

gevro commented 6 months ago

Just to add further confirmation - within that same sample there is an indel that was MASKED properly by the NOISE mask: $ tabix NOISE.sorted.GRCh38.bed.gz chr1:4304586-4304592 chr1 4304586 4304598

In results.indel.vcf.gz: chr1 4304586 . CGTGTGTGTGTGT C 225.417 MASKED INDEL;IDV=11;IMF=1;DP=11;VDB=2.26006e-08;SGB=-0.676189;MQSBZ=0;MQ0F=0;AC=1;AN=1;DP4=0,0,0,11;MQ=60;QPOS=58;RB=chr1,4304440,4304644,GAG,CTG;BBEG=4304440;BEND=4304644;DEPTH_FWD=5;DEPTH_REV=6;DEPTH_NORM_FWD=0;DEPTH_NORM_REV=15;DPLX_ASXS=84;DPLX_CLIP=0;DPLX_NM=14.2;BULK_ASXS=109;BULK_NM=1;NN=[0:209:0];SEQ=GATGTACACGTGTGTGTG GT:PL:DP:DV:SP:DP4 1:255,0:11:11:0:0,0,0,11

So this shows that the noise mask was applied correctly during the run configuration. So that leaves the question why the above indel (chr1 43861434) was not masked.

gevro commented 6 months ago

Another strange thing -- another sample analyzed in the same batch has the same indel artifact and it was masked:

chr1 43861434 . GCTGTCAGGACTTGTATAGA G 225.417 MASKED INDEL;IDV=15;IMF=0.9375;DP=16;VDB=0.0148393;SGB=-0.689466;RPBZ=-1;MQBZ=0;MQSBZ=0;BQBZ=0;SCBZ=0;MQ0F=0;AC=1;AN=1;DP4=0,0,8,8;MQ=60;QPOS=97;RB=chr1,43861337,43861545,AAG,GTG;BBEG=43861337;BEND=43861545;DEPTH_FWD=10;DEPTH_REV=6;DEPTH_NORM_FWD=13;DEPTH_NORM_REV=14;DPLX_ASXS=107;DPLX_CLIP=0;DPLX_NM=19.8;BULK_ASXS=109;BULK_NM=0;NN=[0:153:0];SEQ=GGTGTGGAGCTGTCAGGA GT:PL:DP:DV:SP:DP4 1:255,0:16:16:0:0,0,8,8

So it isn't clear why in one sample it was correctly masked by the NOISE mask, and not in another sample. The code to run the analyses of these samples was the same.

The only thing I can think of is if there is something in the indel masking code that "rescues" an indel even though it is in a region of the masks.

fa8sanger commented 6 months ago

Hi, that’s a rare but possible thing to happen. It’s because how the pipeline counts how many of the indel sites are masked. To count them it relies on information on duplex calls (passing all filters), which come with the masked/unmasked tag. It may happen that the same indel in one case comes with less information and in that case the masked filter may not reach the threshold. It’s a bit intricate and took me some time to understand. The ideal solution would be to load the mask information rather than relying on duplex calls information, but it is how it is at the moment, and from what I’ve seen it is a very rare undesired result. I may implement the ideal solution at some point

Thanks for reporting this

On 10 May 2024, at 03:05, gevro @.***> wrote:

Another strange thing -- another sample analyzed in the same batch has the same indel artifact and it was masked:

chr1 43861434 . GCTGTCAGGACTTGTATAGA G 225.417 MASKED INDEL;IDV=15;IMF=0.9375;DP=16;VDB=0.0148393;SGB=-0.689466;RPBZ=-1;MQBZ=0;MQSBZ=0;BQBZ=0;SCBZ=0;MQ0F=0;AC=1;AN=1;DP4=0,0,8,8;MQ=60;QPOS=97;RB=chr1,43861337,43861545,AAG,GTG;BBEG=43861337;BEND=43861545;DEPTH_FWD=10;DEPTH_REV=6;DEPTH_NORM_FWD=13;DEPTH_NORM_REV=14;DPLX_ASXS=107;DPLX_CLIP=0;DPLX_NM=19.8;BULK_ASXS=109;BULK_NM=0;NN=[0:153:0];SEQ=GGTGTGGAGCTGTCAGGA GT:PL:DP:DV:SP:DP4 1:255,0:16:16:0:0,0,8,8

So it isn't clear why in one sample it was correctly masked by the NOISE mask, and not in another sample. The code to run the analyses of these samples was the same.

The only thing I can think of is if there is something in the indel masking code that "rescues" an indel even though it is in a region of the masks.

— Reply to this email directly, view it on GitHub [github.com]https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_cancerit_NanoSeq_issues_96-23issuecomment-2D2103717217&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=v9-R7fUmjpv-9Zaqyk1nlnlOC3qPkTEJz5tyYxg2uec&m=VdIDijmkJGn28nV9QYvXSgSRrRdkNsVzGG8wCIi4CXZTM3CQ6AhKm3YRtMQqGXY5&s=h7Mmfg6SZS97VZZX2hBF4Z3TZ_VhQrWX-ulXqEWiWoA&e=, or unsubscribe [github.com]https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ADNUT3PQH6OMESFEJCODJBLZBQTPXAVCNFSM6AAAAABHPRGUGCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMBTG4YTOMRRG4&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=v9-R7fUmjpv-9Zaqyk1nlnlOC3qPkTEJz5tyYxg2uec&m=VdIDijmkJGn28nV9QYvXSgSRrRdkNsVzGG8wCIi4CXZTM3CQ6AhKm3YRtMQqGXY5&s=TBdhkjvSuNlRD2hen15gHsPn2lwsKn3Lp_xKG_eJj1A&e=. You are receiving this because you are subscribed to this thread.Message ID: @.***>

The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA.

fa8sanger commented 6 months ago

A third solution is to do some extra filtering a posteriori, intersecting your indels with the masks and determining whether they meet the threshold or not. From what I’ve seen with thousands of indels this is a very rare occurrence though.

On 10 May 2024, at 03:05, gevro @.***> wrote:

Another strange thing -- another sample analyzed in the same batch has the same indel artifact and it was masked:

chr1 43861434 . GCTGTCAGGACTTGTATAGA G 225.417 MASKED INDEL;IDV=15;IMF=0.9375;DP=16;VDB=0.0148393;SGB=-0.689466;RPBZ=-1;MQBZ=0;MQSBZ=0;BQBZ=0;SCBZ=0;MQ0F=0;AC=1;AN=1;DP4=0,0,8,8;MQ=60;QPOS=97;RB=chr1,43861337,43861545,AAG,GTG;BBEG=43861337;BEND=43861545;DEPTH_FWD=10;DEPTH_REV=6;DEPTH_NORM_FWD=13;DEPTH_NORM_REV=14;DPLX_ASXS=107;DPLX_CLIP=0;DPLX_NM=19.8;BULK_ASXS=109;BULK_NM=0;NN=[0:153:0];SEQ=GGTGTGGAGCTGTCAGGA GT:PL:DP:DV:SP:DP4 1:255,0:16:16:0:0,0,8,8

So it isn't clear why in one sample it was correctly masked by the NOISE mask, and not in another sample. The code to run the analyses of these samples was the same.

The only thing I can think of is if there is something in the indel masking code that "rescues" an indel even though it is in a region of the masks.

— Reply to this email directly, view it on GitHub [github.com]https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_cancerit_NanoSeq_issues_96-23issuecomment-2D2103717217&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=v9-R7fUmjpv-9Zaqyk1nlnlOC3qPkTEJz5tyYxg2uec&m=VdIDijmkJGn28nV9QYvXSgSRrRdkNsVzGG8wCIi4CXZTM3CQ6AhKm3YRtMQqGXY5&s=h7Mmfg6SZS97VZZX2hBF4Z3TZ_VhQrWX-ulXqEWiWoA&e=, or unsubscribe [github.com]https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ADNUT3PQH6OMESFEJCODJBLZBQTPXAVCNFSM6AAAAABHPRGUGCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMBTG4YTOMRRG4&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=v9-R7fUmjpv-9Zaqyk1nlnlOC3qPkTEJz5tyYxg2uec&m=VdIDijmkJGn28nV9QYvXSgSRrRdkNsVzGG8wCIi4CXZTM3CQ6AhKm3YRtMQqGXY5&s=TBdhkjvSuNlRD2hen15gHsPn2lwsKn3Lp_xKG_eJj1A&e=. You are receiving this because you are subscribed to this thread.Message ID: @.***>

The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA.

gevro commented 6 months ago

Thanks, however, I'm seeing a very large number of these examples to the point that artifacts outnumber real indels. What are the key parameters or lines of code that do this? And what was the motivation to make it more intricate versus filtering out those regions directly?

gevro commented 6 months ago

Also, is this issue only relevant for indel filtering or also snv filtering?

gevro commented 6 months ago

Sorry for all the questions, but also, is there anything in the final VCF results that can allow me to determine which of these indel sites had this issue versus not? Or is this only determinable internally at the time of filtering?

fa8sanger commented 6 months ago

As I suggested you can intersect with the SNP & NOISE masks to evaluate the overlap

On 10 May 2024, at 11:49, gevro @.***> wrote:

Sorry for all the questions, but also, is there anything in the final VCF results that can allow me to determine which of these indel sites had this issue versus not? Or is this only determinable internally at the time of filtering?

— Reply to this email directly, view it on GitHub [github.com]https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_cancerit_NanoSeq_issues_96-23issuecomment-2D2104392588&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=v9-R7fUmjpv-9Zaqyk1nlnlOC3qPkTEJz5tyYxg2uec&m=YEcu9IKu2oMngmz8czJuL9x6Ug4WAXUvhHsJjjrMs5zeySJvTcLbT4YkkAVm6tVN&s=2VYLtx3-ukQexroI6PQ9nElvP7wBK77UWlMz8XxHADE&e=, or unsubscribe [github.com]https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ADNUT3OSFOQ3D4MZGN2HY5TZBSQ35AVCNFSM6AAAAABHPRGUGCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMBUGM4TENJYHA&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=v9-R7fUmjpv-9Zaqyk1nlnlOC3qPkTEJz5tyYxg2uec&m=YEcu9IKu2oMngmz8czJuL9x6Ug4WAXUvhHsJjjrMs5zeySJvTcLbT4YkkAVm6tVN&s=Q5eWcTApp0VvNDkLvJZl6cUO0QVXKjDJbF94Ug9NtJ8&e=. You are receiving this because you commented.Message ID: @.***>

The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA.

fa8sanger commented 6 months ago

Do you mean those artifacts are cause by this rare circumstance? I wouldn’t expect so

The reasons… no particular one, I may change that in the future.

On 10 May 2024, at 11:40, gevro @.***> wrote:

Thanks, however, I'm seeing a very large number of these examples to the point that artifacts outnumber real indels. What are the key parameters or lines of code that do this? And what was the motivation to make it more intricate versus filtering out those regions directly?

— Reply to this email directly, view it on GitHub [github.com]https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_cancerit_NanoSeq_issues_96-23issuecomment-2D2104380189&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=v9-R7fUmjpv-9Zaqyk1nlnlOC3qPkTEJz5tyYxg2uec&m=L7yrV9gWysGAkS5dl-Nx4L-dx7eosM00rv5KwCSKrv2-ViH66GQIYMolH-pEWfdX&s=bQOcytK8ehlZARHRTboqEx2l3lt7LuI2us1pCzg-A5U&e=, or unsubscribe [github.com]https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ADNUT3NWF5UCCC2QVKHGRUTZBSP3BAVCNFSM6AAAAABHPRGUGCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMBUGM4DAMJYHE&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=v9-R7fUmjpv-9Zaqyk1nlnlOC3qPkTEJz5tyYxg2uec&m=L7yrV9gWysGAkS5dl-Nx4L-dx7eosM00rv5KwCSKrv2-ViH66GQIYMolH-pEWfdX&s=RMxOmzNX1UNAqZfyBvSkxVcxRkGAfDFMWHCL5OKfnes&e=. You are receiving this because you commented.Message ID: @.***>

The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA.

fa8sanger commented 6 months ago

It doesn’t affect snvs

On 10 May 2024, at 11:44, gevro @.***> wrote:

Also, is this issue only relevant for indel filtering or also snv filtering?

— Reply to this email directly, view it on GitHub [github.com]https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_cancerit_NanoSeq_issues_96-23issuecomment-2D2104385027&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=v9-R7fUmjpv-9Zaqyk1nlnlOC3qPkTEJz5tyYxg2uec&m=Z-AYfKLuWsEMgWA5s7wm1xEFhz0Zs90wxYK6_8oyyrKx2XBtUDGRHSKpGztjGyNH&s=AjguJOQcp3cr7bXygKTRhIW-2I6BM7Mzlqo6AtJJjfA&e=, or unsubscribe [github.com]https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ADNUT3NDLT6XU3KRJMBSISLZBSQHVAVCNFSM6AAAAABHPRGUGCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMBUGM4DKMBSG4&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=v9-R7fUmjpv-9Zaqyk1nlnlOC3qPkTEJz5tyYxg2uec&m=Z-AYfKLuWsEMgWA5s7wm1xEFhz0Zs90wxYK6_8oyyrKx2XBtUDGRHSKpGztjGyNH&s=KC3NAPnc0kBJLVBFg_aSki-jl7D7LWkaV0G3-RbC-3I&e=. You are receiving this because you commented.Message ID: @.***>

The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA.

gevro commented 6 months ago

Do you mean those artifacts are cause by this rare circumstance? I wouldn’t expect so Yes, I'm seeing quite a few of these artifacts. It doesn't seem that rare. In the meantime I will manually intersect the indel calls with the SNP and NOISE masks again post-analysis. I'm not sure I understand why in some samples it would be a more prevalent issue vs other samples.

gevro commented 6 months ago

Hi, I think I found a possible contributor to the higher than expected indel rates. I think there might be a coordinate bug in your NOISE.sorted.GRCh38.bed.gz file.

I am seeing for example many of this indel artifact:

  CHROM     POS   ID REF ALT    QUAL FILTER
1  chr1 5287895 <NA>   G  GC 92.4151   PASS
2  chr1 5287895 <NA>   G  GC 108.415   PASS
3  chr1 5287895 <NA>   G  GC 130.416   PASS
4  chr1 5287895 <NA>   G  GC 173.416   PASS
                                                                                                                                                                                                                                                                                                                                      INFO
1       INDEL;IDV=4;IMF=1;DP=4;VDB=0.0058656;SGB=-0.556411;MQSBZ=0;BQBZ=0;MQ0F=0;AC=1;AN=1;DP4=0,0,4,0;MQ=60;QPOS=121;RB=chr1,5287774,5288396,ATG,AAT;BBEG=5287774;BEND=5288396;DEPTH_FWD=2;DEPTH_REV=2;DEPTH_NORM_FWD=15;DEPTH_NORM_REV=0;DPLX_ASXS=115;DPLX_CLIP=0;DPLX_NM=1;BULK_ASXS=101;BULK_NM=0;NN=[0:175:0];SEQ=TTGGCAGTGAGCTGCTGT
2           INDEL;IDV=5;IMF=1;DP=5;VDB=0.00187095;SGB=-0.590765;BQBZ=-2;MQ0F=0;AC=1;AN=1;DP4=0,0,5,0;MQ=60;QPOS=122;RB=chr1,5287773,5288395,TAA,CCC;BBEG=5287773;BEND=5288395;DEPTH_FWD=2;DEPTH_REV=3;DEPTH_NORM_FWD=15;DEPTH_NORM_REV=0;DPLX_ASXS=109;DPLX_CLIP=0;DPLX_NM=2.3;BULK_ASXS=101;BULK_NM=0;NN=[0:175:0];SEQ=TTGGCAGTGAGCTGCTGT
3     INDEL;IDV=7;IMF=1;DP=7;VDB=6.71664e-05;SGB=-0.636426;BQBZ=-1.1547;MQ0F=0;AC=1;AN=1;DP4=0,0,7,0;MQ=60;QPOS=122;RB=chr1,5287773,5288537,TAT,CCT;BBEG=5287773;BEND=5288537;DEPTH_FWD=4;DEPTH_REV=3;DEPTH_NORM_FWD=15;DEPTH_NORM_REV=0;DPLX_ASXS=104;DPLX_CLIP=0;DPLX_NM=3.3;BULK_ASXS=101;BULK_NM=0;NN=[0:175:0];SEQ=TTGGCAGTGAGCTGCTGT
4 INDEL;IDV=13;IMF=1;DP=13;VDB=2.26006e-08;SGB=-0.683931;BQBZ=-1.96666;MQ0F=0;AC=1;AN=1;DP4=0,0,13,0;MQ=60;QPOS=121;RB=chr1,5287774,5288395,AAG,TGG;BBEG=5287774;BEND=5288395;DEPTH_FWD=9;DEPTH_REV=4;DEPTH_NORM_FWD=15;DEPTH_NORM_REV=0;DPLX_ASXS=112;DPLX_CLIP=0;DPLX_NM=1.5;BULK_ASXS=101;BULK_NM=0;NN=[0:175:0];SEQ=TTGGCAGTGAGCTGCTGT

This artifact is apparent in gnomAD: https://gnomad.broadinstitute.org/variant/1-5287895-G-GC?dataset=gnomad_r3

Note that the position in 1-based coordinates in gnomAD is: chr1:5287895

However, in NOISE.sorted.GRCh38.bed.gz, the coordinates are: chr1 5287895 5287896

And it is not in SNP.sorted.GRCh38.bed.gz

So there might be a coordinate bug in the code that generated the NOISE mask. Maybe also in the SNP mask for some subset of it?

It also depends how your NanoSeq pipeline interprets these masks -- I'm assuming as 0-based coordinate BED files? If so, then only the NOISE mask needs to be fixed. Are you able to share the code that generated the NOISE mask, and I can try to help fix the bug to generate a new one?

Thanks

gevro commented 6 months ago

Addendum: I'm not 100% sure but I think this coordinate bug might only be happening with NOISE mask sites that correspond to indels in gnomAD. And not FILTERED gnomAD substitution sites.

I wonder if maybe in your code that generated the NOISE mask for gnomAD indels, that you set the coordinate to the base after the REF position. For example for the gnomAD indel site above of: chr1:5287895 G > GC, you set it to chr1:5287896. And then in the NOISE bed file it is set to chr1 5287895 5287896.

However, the NanoSeq pipeline calls this indel with POS = 5287895, and then your NanoSeq pipeline doesn't know which of the NOISE mask sites are indels, so it doesn't filter the call.

I'm assuming the solution would be to instead put the gnomAD indel-related NOISE mask sites using the position corresponding to the REF POS, i.e. in the above situation as chr1 5287894 5287895 in BED coordinates, which corresponds to chr1:5287895 in 1-based coordinates.

I can fix this if you send me the code that generated the GRCh38 NOISE mask.

Thanks

fa8sanger commented 6 months ago

I am attaching two files, one explaining how the masks were built and one with the perl code used get_subs_above_AF_v3.pl.txt GENERATION OF THE SNP MASK FOR BOTSEQ.txt get_subs_above_AF_v3.GRCh38.pl.txt

fa8sanger commented 6 months ago

Let me know if you find a problem, please. I added "txt" to the perl scripts to be able to attach them here

fa8sanger commented 6 months ago

I think you are right, I may have not handled insertions properly for the masks

gevro commented 6 months ago

I think I see where the bug is. But to fix this, it will help me to also understand internally in the nanoseq pipeline how insertion and deletion coordinates are represented against which the masks are applied? For example, for deletions in VCF coordinates, the first base of the POS is actually not deleted. The question is if in the nanoseq pipeline, that is also how coordinates are represented or do you represent the deleted bases only?

And for insertions in the nanoseq pipeline do you use the base in the reference that is before the insertion to represent the insertion, similar to how VCFs POS value represents insertions?

fa8sanger commented 6 months ago

Unless it's urgent for you, leave this with me. I'll investigate this a bit more to check coordinates are always handled the same way, and then I'll generate those masks again

gevro commented 6 months ago

Ok thanks. Regardless of the bug fix, I'm curious about the answers to these two questions for future reference?

In the nanoseq pipeline internally, how are deletion coordinates represented at the moment of NOISE masking -- in the same way as in VCF POS format (i.e., the REF base before the deletion), or the actual deleted bases?
In the nanoseq pipeline internally, are insertion coordinates specified at the moment of NOISE masking in the same way as in VCF POS format, i.e. the REF base preceding the inserted bases?

Note also that it looks like the hg38 NOISE mask is much smaller in terms of total # of bases than the hg19 NOISE mask, so I wonder if some other component is missing in hg38.

fa8sanger commented 6 months ago

In the nanoseq pipeline each of the deleted bases would be called separately. Regarding insertion coordinates, I'd like to investigate this more

gevro commented 6 months ago

Ok, sure. Let me know if I can help. Not to bias you as you investigate, but just for my reference, my guess is these lines should be start = pos - 1 and end = pos, instead of the current:

            } elsif(length($alt) > length($ref)) {
                $type  = "ins";
                $start = $pos;
                $end   = $pos+1;

fa8sanger commented 6 months ago

I've uploaded the new masks to the usual repository, within a directory named New_5thJune2024 (link).

I opted for keeping the previously masked base and adding the previous one. Hence the current masks now mask an additional 169,913 bps

gevro commented 6 months ago

Thanks very much. Just a few questions:

Confirming this bug only affected the NOISE mask gnomad insertions, specifically they were missing the base before the indel?
Why is the # of bases covered by the NOISE mask for hg38 = 10348611 and for hg19 = 22474160?

fa8sanger commented 6 months ago

yes, I confirm that
for GRCh37 we also included information from our available panel of normal: "The noise mask also contains sites with elevated error-rates. For each genomic position, error-rates were estimated for each site using the fraction of mismatched bases across a panel of 448 in-house sequenced samples. Sites with error-rates > 0.01 were incorporated into the noise mask". We didn't have this for GRCh38

gevro commented 5 months ago

Hi, I think I found another bug for the NOISE mask filtering. It looks like deletions are not being filtered properly.

Take for example this gnomad vcf deletion variant: chr17 76924456 . GGTGGGAAGCAGGTAACCA G https://gnomad.broadinstitute.org/variant/17-76924456-GGTGGGAAGCAGGTAACCA-G?dataset=gnomad_r3

This was not filtered even though the NOISE mask bed file had this: chr17 76924456 76924474

Specifically, the deletion in the NOISE mask bed file includes only the deleted bases, i.e. it spans chr17 76924456 76924474 in 0-based coordinates, which is chr17:76924457-76924474 in 1-based coordinates. Therefore, the NOISE mask does not include the 1-based coordinate chr17:76924456.

This suggests that the Nanoseq pipeline is not using NOISE mask regions properly to filter deletions.

An easy fix is to include in the NOISE mask for deletions also the preceding base. Though technically Nanoseq should have filtered these deletions.

Thanks

fa8sanger commented 5 months ago

This must be related to the infrequent problem that we discussed before. As discussed, a temporary solution is to get rid of those indels intersecting yourself the final indel calls against the masks. It seems you have contamination in your DNA and by calling more indels from the contaminant you are picking more of these unusual cases.

I’d like to change the pipeline to load the SNP/NOISE masks on the perl calling script, not relying on passing information from the dsa tables as currently done.

On 14 Jun 2024, at 19:24, gevro @.***> wrote:

Hi, I think I found another bug for the NOISE mask filtering. It looks like deletions are not being filtered properly.

Take for example this gnomad vcf deletion variant: chr17 76924456 . GGTGGGAAGCAGGTAACCA G https://gnomad.broadinstitute.org/variant/17-76924456-GGTGGGAAGCAGGTAACCA-G?dataset=gnomad_r3 [gnomad.broadinstitute.org]https://urldefense.proofpoint.com/v2/url?u=https-3A__gnomad.broadinstitute.org_variant_17-2D76924456-2DGGTGGGAAGCAGGTAACCA-2DG-3Fdataset-3Dgnomad-5Fr3&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=v9-R7fUmjpv-9Zaqyk1nlnlOC3qPkTEJz5tyYxg2uec&m=p4pmhmRNU1aLqRAFeNfMqKhSB0XyZt7UV7smwceg4ebRKsSzseJjjvJUfD4hLjr7&s=XCDgRNzotlQxIicpSOnNA4UgUGagLxL6CPTgMRGRp_k&e=

This was not filtered even though the NOISE mask bed file had this: chr17 76924456 76924474

Specifically, the deletion in the NOISE mask bed file includes only the deleted bases, i.e. it spans chr17 76924456 76924474 in 0-based coordinates, which is chr17:76924457-76924474 in 1-based coordinates. Therefore, the NOISE mask does not include the 1-based coordinate chr17:76924456.

This suggests that the Nanoseq pipeline is not using NOISE mask regions properly to filter deletions.

An easy fix is to include in the NOISE mask for deletions also the preceding base. Though technically Nanoseq should have filtered these deletions.

Thanks

— Reply to this email directly, view it on GitHub [github.com]https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_cancerit_NanoSeq_issues_96-23issuecomment-2D2168546969&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=v9-R7fUmjpv-9Zaqyk1nlnlOC3qPkTEJz5tyYxg2uec&m=p4pmhmRNU1aLqRAFeNfMqKhSB0XyZt7UV7smwceg4ebRKsSzseJjjvJUfD4hLjr7&s=YsqRT1EVMQIyJqvTFA7-hsehuaiAXG_iinCqesm5ykY&e=, or unsubscribe [github.com]https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ADNUT3ISF734GEHPMTGM3XTZHMYORAVCNFSM6AAAAABHPRGUGCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRYGU2DMOJWHE&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=v9-R7fUmjpv-9Zaqyk1nlnlOC3qPkTEJz5tyYxg2uec&m=p4pmhmRNU1aLqRAFeNfMqKhSB0XyZt7UV7smwceg4ebRKsSzseJjjvJUfD4hLjr7&s=5Auw4RtN0vmNnoij6pwXZGBZH7CZJF8cWDXE0DNnMS4&e=. You are receiving this because you commented.Message ID: @.***>

The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA.

gevro commented 5 months ago

Thanks. I'll try to rerun with an extra 1 preceding base in the NOISE mask for deletions confirm.

Regarding the possibility of contamination: we know the vast majority of these indel artifacts are not contamination due to our study design. We have a set of samples from different individuals, with two independent biological somatic samples from each individual and undiluted nanoseq germline from a different tissue than the somatic samples. We either see these indel artifacts only in one somatic sample, or often in both somatic samples of the same individual but not any of the samples of any of the other individuals. If this were contamination, it would be improbable that they tend to show up concordantly in both somatic samples of the same individual. Our FREEMIX values are also very low.

There are two explanations I can think of: 1) these are early mosaic variants that did not show up in the undiluted germline data of that individual. 2) I noticed these tend to happen in regions with lower germline data read depth, near our cutoff limit of 15 reads. Perhaps indels affect PCR efficiency so one allele is not amplified as well and stochastically will not show up in germline data regions with lower read depth. An analysis of the width of the VAF distribution of indels vs snps in germline data might support this, if I find time to do this.

fa8sanger commented 5 months ago

You are seeing a long indel seen in the human population at high AF. That has to be germline. If it’s not contamination perhaps you are picking germline variants from your donor. Since you mention that this happens in regions with lower coverage, it makes sense. Although a coverage of 15 should be enough to pick germline, with indels you may find unexpected behaviour (bwa sometimes prefers to introduce mismatches for indels near read ends, and for long indels it may prefer to do soft clipping… I’d check what’s happening in the matched normal with IGV)

On 15 Jun 2024, at 21:33, gevro @.***> wrote:

Thanks. I'll try to rerun with an extra 1 preceding base in the NOISE mask for deletions confirm.

Regarding the possibility of contamination: we know the vast majority of these indel artifacts are not contamination due to our study design. We have a set of samples from different individuals, with two independent biological somatic samples from each individual and undiluted nanoseq germline from a different tissue than the somatic samples. We either see these indel artifacts only in one somatic sample, or often in both somatic samples of the same individual but not any of the samples of any of the other individuals. If this were contamination, it would be improbable that they tend to show up concordantly in both somatic samples of the same individual. Our FREEMIX values are also very low.

There are two explanations I can think of: 1) these are early mosaic variants that did not show up in the undiluted germline data of that individual. 2) I noticed these tend to happen in regions with lower germline data read depth, near our cutoff limit of 15 reads. Perhaps indels affect PCR efficiency so one allele is not amplified as well and stochastically will not show up in germline data regions with lower read depth. An analysis of the width of the VAF distribution of indels vs snps in germline data might support this, if I find time to do this.

— Reply to this email directly, view it on GitHub [github.com]https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_cancerit_NanoSeq_issues_96-23issuecomment-2D2170639797&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=v9-R7fUmjpv-9Zaqyk1nlnlOC3qPkTEJz5tyYxg2uec&m=_TlEf2f4vtqamP4DiEegYFCJC6c4Squd6tjbTY3lxmf8OFSrlaXEcXFWRtWxf8AZ&s=kJyfGY8PTRY3xCLX85GjioCsNmglQsx4lt7kiVIR6dQ&e=, or unsubscribe [github.com]https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ADNUT3KQS2KO55G6VSG33IDZHSQIXAVCNFSM6AAAAABHPRGUGCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZQGYZTSNZZG4&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=v9-R7fUmjpv-9Zaqyk1nlnlOC3qPkTEJz5tyYxg2uec&m=_TlEf2f4vtqamP4DiEegYFCJC6c4Squd6tjbTY3lxmf8OFSrlaXEcXFWRtWxf8AZ&s=Zood4XGOeJMpcybnkdR9nyHcWTf2FR564ZGnBKeYA7Q&e=. You are receiving this because you commented.Message ID: @.***>

The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA.

gevro commented 5 months ago

Sorry for not clarifying - some of these indels that show up > 1 time per sample in both somatic samples of an individual are in gnomad, but many also are not. I agree the ones in gnomad must be germline. The ones that are not in gnomad I think could be either germline but not detected in the germline sample, or they are early developmental mosaics or later clonal expansions that are in reality not in the germline sample that is from a different tissue than the somatic samples.

I looked in IGV at some of the ones that are in gnomad, which should be germline, but I don't see any sign of them in the raw reads. The only explanation I can think of is maybe PCR slightly skewed against amplifying the indel-containing allele. If this is the case, the prediction would be that the VAF distribution of true gnomad/germline indels is wider or has a lower VAF tail compared to the SNV VAF distribution, and the fix would be requiring a slightly higher read depth for indels. I can keep investigating and will let you know if I figure anything else out.

gevro commented 5 months ago

Hi, Just an update - I finished the rerun of the nanoseq pipeline with two NOISE masks that differ as follows: 1) gnomad deletions do contain the non-deleted preceding REF base (i.e. the POS column position in the gnomad VCF) and 2) gnomad deletions that do not contain the preceding REF base, i.e. the BED regions span only the actual deleted bases.

NOISE mask (1) removed nanoseq deletion calls that correspond to the gnomad-annotated deletions but NOISE mask (2) did not. This indicates that somewhere in the Nanoseq pipeline, it is filtering out NOISE mask deletions in a way that requires the NOISE mask to also include the preceding reference non-deleted base. Easy work-around is just using NOISE mask (1) either during the nanoseq pipeline or for post-pipeline filtering, but wanted to let you know since in your next iteration of NOISE mask filtering, this would probably be good to adjust.

Thanks!

fa8sanger commented 5 months ago

That’s surprising and I find it hard to understand, I’ve run many analyses and never noticed a problem with deletions (other than the other issue discussed). For the long deletion that you submitted, adding an extra base should have had no influence as there were many bases for which a masked position would be counted Did NOISE (1) remove 1-base deletions as well?

On 17 Jun 2024, at 13:20, gevro @.***> wrote:

Hi, Just an update - I finished the rerun of the nanoseq pipeline with two NOISE masks that differ as follows: 1) gnomad deletions do contain the non-deleted preceding REF base (i.e. the POS column position in the gnomad VCF) and 2) gnomad deletions that do not contain the preceding REF base, i.e. the BED regions span only the actual deleted bases.

NOISE mask (1) removed nanoseq deletion calls that correspond to the gnomad-annotated deletions but NOISE mask (2) did not. This indicates that somewhere in the Nanoseq pipeline, it is filtering out NOISE mask deletions in a way that requires the NOISE mask to also include the preceding reference non-deleted base. Easy work-around is just using NOISE mask (1) either during the nanoseq pipeline or for post-pipeline filtering, but wanted to let you know since in your next iteration of NOISE mask filtering, this would probably be good to adjust.

Thanks!

— Reply to this email directly, view it on GitHub [github.com]https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_cancerit_NanoSeq_issues_96-23issuecomment-2D2173251454&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=v9-R7fUmjpv-9Zaqyk1nlnlOC3qPkTEJz5tyYxg2uec&m=5CB_WHrD8BnhiNpWG2mb3O6yGzija0e_GgV3KGhWARarPdP5TS-8h-qr3E1h_pf9&s=teiN_aFF9-zqI7V5kUcmfEItBITQIpx9ojAb3PgCDcI&e=, or unsubscribe [github.com]https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ADNUT3O5P2EEC5ETV2NVKQLZH3ICJAVCNFSM6AAAAABHPRGUGCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZTGI2TCNBVGQ&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=v9-R7fUmjpv-9Zaqyk1nlnlOC3qPkTEJz5tyYxg2uec&m=5CB_WHrD8BnhiNpWG2mb3O6yGzija0e_GgV3KGhWARarPdP5TS-8h-qr3E1h_pf9&s=GcLh1tCFACjo1AhNYaKDKNhje9c6ism9E80m1UxjfgE&e=. You are receiving this because you commented.Message ID: @.***>

The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA.

fa8sanger commented 5 months ago

Would you be able to send me some of those indels and the NOISE(1) mask to do some tests with bedtools?

On 17 Jun 2024, at 13:47, Federico Abascal @.***> wrote:

That’s surprising and I find it hard to understand, I’ve run many analyses and never noticed a problem with deletions (other than the other issue discussed). For the long deletion that you submitted, adding an extra base should have had no influence as there were many bases for which a masked position would be counted Did NOISE (1) remove 1-base deletions as well?

On 17 Jun 2024, at 13:20, gevro @.***> wrote:

Hi, Just an update - I finished the rerun of the nanoseq pipeline with two NOISE masks that differ as follows: 1) gnomad deletions do contain the non-deleted preceding REF base (i.e. the POS column position in the gnomad VCF) and 2) gnomad deletions that do not contain the preceding REF base, i.e. the BED regions span only the actual deleted bases.

NOISE mask (1) removed nanoseq deletion calls that correspond to the gnomad-annotated deletions but NOISE mask (2) did not. This indicates that somewhere in the Nanoseq pipeline, it is filtering out NOISE mask deletions in a way that requires the NOISE mask to also include the preceding reference non-deleted base. Easy work-around is just using NOISE mask (1) either during the nanoseq pipeline or for post-pipeline filtering, but wanted to let you know since in your next iteration of NOISE mask filtering, this would probably be good to adjust.

Thanks!

— Reply to this email directly, view it on GitHub [github.com]https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_cancerit_NanoSeq_issues_96-23issuecomment-2D2173251454&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=v9-R7fUmjpv-9Zaqyk1nlnlOC3qPkTEJz5tyYxg2uec&m=5CB_WHrD8BnhiNpWG2mb3O6yGzija0e_GgV3KGhWARarPdP5TS-8h-qr3E1h_pf9&s=teiN_aFF9-zqI7V5kUcmfEItBITQIpx9ojAb3PgCDcI&e=, or unsubscribe [github.com]https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ADNUT3O5P2EEC5ETV2NVKQLZH3ICJAVCNFSM6AAAAABHPRGUGCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZTGI2TCNBVGQ&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=v9-R7fUmjpv-9Zaqyk1nlnlOC3qPkTEJz5tyYxg2uec&m=5CB_WHrD8BnhiNpWG2mb3O6yGzija0e_GgV3KGhWARarPdP5TS-8h-qr3E1h_pf9&s=GcLh1tCFACjo1AhNYaKDKNhje9c6ism9E80m1UxjfgE&e=. You are receiving this because you commented.Message ID: @.***>

The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA.

gevro commented 5 months ago

I found an example also of a one base deletion that was removed by NOISE mask (1) but not NOISE mask (2).

Indel present when using NOISE mask (2) but absent with NOISE mask (1): chr1 173705618 . GT G 67.4148 PASS INDEL;IDV=4;IMF=1;DP=4;VDB=0.0058656;SGB=-0.556411;BQBZ=-1.41421;MQ0F=0;AC=1;AN=1;DP4=0,0,0,4;MQ=60;QPOS=10;RB=chr1,173705035,173705628,GAG,ATT;BBEG=173705035;BEND=173705628;DEPTH_FWD=2;DEPTH_REV=2;DEPTH_NORM_FWD=0;DEPTH_NORM_REV=24;DPLX_ASXS=83;DPLX_CLIP=0;DPLX_NM=2.5;BULK_ASXS=94;BULK_NM=1;NN=[0:175:0];SEQ=TGCGCTGTGTTTGCACCA GT:PL:DP:DV:SP:DP4 1:97,0:4:4:0:0,0,0,4

NOISE mask (2) had this region: chr1 173705618 173705619 -> This should have filtered out the deletion, because the deleted 'T' base is chr1:173705619 in 1-based coordinates and is chr 1 173705618 173705619 in 0-based BED coordinates.

NOISE mask (1) had this region: chr1 173705617 173705619 -> This filtered out the deletion, presumably because the nanoseq pipeline was looking at the non-deleted preceding 'G' reference base at position chr1:173705618 in 1-based coordinates, which is chr1 173705617 173705618 in 0-based coordinates.

gnomad entry: https://gnomad.broadinstitute.org/variant/1-173705618-GT-G?dataset=gnomad_r3

Note, I'm experimenting with a different NOISE mask for hg38 than your code generates, since I wanted to use gnomad v3.1.2 and filter out some segdups and other noisy regions explicitly, since your 'shearwater' reference is not available for hg38. But the findings above regardless still show some filtering issue.

I will send you the details by direct email.

fa8sanger commented 5 months ago

No need to send anything else, this is what I needed. The intervals are properly handled here, hence there should be no need to add an extra base to the NOISE mask.

Now, why this gets through NanoSeq is something that I don’t understand. NanoSeq relies on bcftools for the calling of indels, and bcftools generates calls in VCF format with the standard coordinates.

What may have happened is that the information of what matches the mask wasn’t received properly (the issue discussed in the past). Or there may be some bug in my perl code, but then I run it on so many samples without problem...

On 17 Jun 2024, at 15:16, gevro @.***> wrote:

I found an example also of a one base deletion that was removed by NOISE mask (1) but not NOISE mask (2).

Indel present when using NOISE mask (2) but absent with NOISE mask (1): chr1 173705618 . GT G 67.4148 PASS INDEL;IDV=4;IMF=1;DP=4;VDB=0.0058656;SGB=-0.556411;BQBZ=-1.41421;MQ0F=0;AC=1;AN=1;DP4=0,0,0,4;MQ=60;QPOS=10;RB=chr1,173705035,173705628,GAG,ATT;BBEG=173705035;BEND=173705628;DEPTH_FWD=2;DEPTH_REV=2;DEPTH_NORM_FWD=0;DEPTH_NORM_REV=24;DPLX_ASXS=83;DPLX_CLIP=0;DPLX_NM=2.5;BULK_ASXS=94;BULK_NM=1;NN=[0:175:0];SEQ=TGCGCTGTGTTTGCACCA GT:PL:DP:DV:SP:DP4 1:97,0:4:4:0:0,0,0,4

NOISE mask (2) had this region: chr1 173705618 173705619 -> This should have filtered out the deletion, because the deleted 'T' base is chr1:173705619 in 1-based coordinates and is chr 1 173705618 173705619 in 0-based BED coordinates.

NOISE mask (1) had this region: chr1 173705617 173705619 -> This filtered out the deletion, presumably because the nanoseq pipeline was looking at the non-deleted preceding 'G' reference base at position chr1:173705618 in 1-based coordinates, which is chr1 173705617 173705618 in 0-based coordinates.

gnomad entry: https://gnomad.broadinstitute.org/variant/1-173705618-GT-G?dataset=gnomad_r3 [gnomad.broadinstitute.org]https://urldefense.proofpoint.com/v2/url?u=https-3A__gnomad.broadinstitute.org_variant_1-2D173705618-2DGT-2DG-3Fdataset-3Dgnomad-5Fr3&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=v9-R7fUmjpv-9Zaqyk1nlnlOC3qPkTEJz5tyYxg2uec&m=snXbxXLLu8Z6PVCdNoveUcGFjDOPfqN6e7e7hBhMhSc-lqQkxc2qdJhxalO4CR3m&s=XGQgL_Uqrdo7kj2lANlnBpkhi9tzKoO1Csdv2LjXgt8&e=

Note, I'm experimenting with a different NOISE mask for hg38 than your code generates, since I wanted to use gnomad v3.1.2 and filter out some segdups and other noisy regions explicitly, since your 'shearwater' reference is not available for hg38. But the findings above regardless still show some filtering issue.

I will send you the details by direct email.

— Reply to this email directly, view it on GitHub [github.com]https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_cancerit_NanoSeq_issues_96-23issuecomment-2D2173544826&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=v9-R7fUmjpv-9Zaqyk1nlnlOC3qPkTEJz5tyYxg2uec&m=snXbxXLLu8Z6PVCdNoveUcGFjDOPfqN6e7e7hBhMhSc-lqQkxc2qdJhxalO4CR3m&s=XWQoyUl9cuS4LQJNKe1tmbrR1goBff1SEWjC7Ji3cbI&e=, or unsubscribe [github.com]https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ADNUT3OHTGVYNGPYOMN4AVTZH3VSZAVCNFSM6AAAAABHPRGUGCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZTGU2DIOBSGY&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=v9-R7fUmjpv-9Zaqyk1nlnlOC3qPkTEJz5tyYxg2uec&m=snXbxXLLu8Z6PVCdNoveUcGFjDOPfqN6e7e7hBhMhSc-lqQkxc2qdJhxalO4CR3m&s=AQx2KqbA6GaDUUbFbKgn5pc2fB80DP2HTsB-kJNXMcs&e=. You are receiving this because you commented.Message ID: @.***>

The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA.

gevro commented 5 months ago

Ok. Let me know if I can provide any other info.

fa8sanger commented 5 months ago

Thank you. I was thinking that it is very strange that you get so many indel calls without having contamination, and many of them overlapping gnomad indels. You mentioned that there were positions with germline indels that were absent in your matched normal. I wonder if there may be some problem with the matched normal, is it depleted of indels? Does it contain PCR read duplicates providing an inflated coverage? Was it aligned with bwa?

On 17 Jun 2024, at 15:51, gevro @.***> wrote:

Ok. Let me know if I can provide any other info.

— Reply to this email directly, view it on GitHub [github.com]https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_cancerit_NanoSeq_issues_96-23issuecomment-2D2173628292&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=v9-R7fUmjpv-9Zaqyk1nlnlOC3qPkTEJz5tyYxg2uec&m=zVdvfnEi-F6rj3qyhdTTdbjBXa4WBahq--zfO1Kh-nAWiD9xbFTgV1tbeOe2axvD&s=vTkUiaqbJQMfU3iv6sV8Ozc9EcuAnS5p-Qbt9dxT0e8&e=, or unsubscribe [github.com]https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ADNUT3NXYZGANAPGSUV4H33ZH3ZXJAVCNFSM6AAAAABHPRGUGCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZTGYZDQMRZGI&d=DwMCaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=v9-R7fUmjpv-9Zaqyk1nlnlOC3qPkTEJz5tyYxg2uec&m=zVdvfnEi-F6rj3qyhdTTdbjBXa4WBahq--zfO1Kh-nAWiD9xbFTgV1tbeOe2axvD&s=gBkdjFcWHeWdjv5BmnH7wcHn3PSa0gEUloZre8uiR30&e=. You are receiving this because you commented.Message ID: @.***>

The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA.

gevro commented 5 months ago

I am wondering too. Matched germline and somatic samples were all aligned in the exact same bwa pipeline. And sequenced all on the same instrument type (Novaseq X). I see plenty of indels in the matched normal, and it's just a relatively small (~10) of these that I see per sample. If there was some more radical depletion of indels in the matched normal, there would be many more somatic indel calls.

We actually check posts-deduplication coverage on all our matched normal and they are fine, all at least 30x post-deduplication, and usually much more. Regardless, the required minimum read depth in the pipeline for germline should eliminate that as the reason.

It could also be that the reason I'm noticing these is because of our study design where we have > 1 somatic sample from each person? That makes them easier to spot. And also because of the gnomad indel filtering bug which also brought out more of them. Possible that without those two things, I would never have noticed.

cancerit / NanoSeq

Possible bug in filtering of indels #96