PapenfussLab / gridss

GRIDSS: the Genomic Rearrangement IDentification Software Suite
Other
254 stars 71 forks source link

GRIDSS should force DEL-with-INS alignment to prevent inconsistency between asm and direct read calls #622

Open RobinVanSchendel opened 1 year ago

RobinVanSchendel commented 1 year ago

Thanks for creating Gridss. I have been using it now for a while and it is for me one of the better tools. I have noticed occasionally that events are not correctly called. This is a C. elegans sample btw:

G4_deletion

This event is called as:

CHROMOSOME_IV:13880132-13880159:

CHROMOSOME_IV 13880132 gridss5fb_128989o C C[CHROMOSOME_IV:13880159[ 1481.59 ASSEMBLY_ONLY ANRP=0;ANRPQ=0.00;ANSR=0;ANSRQ=0.00;AS=1;ASC=1X;ASQ=601.30;ASRP=6;ASSR=60;BA=0;BAQ=0.00;BASRP=0;BASSR=0;BEID=asm5-1336283,asm5-614184;BEIDH=0,0;BEIDL=0,0;BMQ=59.73;BMQN=40.00;BMQX=60.00;BQ=1953.34;BSC=73;BSCQ=1953.34;BUM=0;BUMQ=0.00;BVF=67;CAS=0;CASQ=0.00;CQ=1499.26;EVENT=gridss5fb_128989;IC=0;IHOMPOS=0,0;IQ=0.00;MATEID=gridss5fb_128989h;MQ=60.00;MQN=60.00;MQX=60.00;RAS=1;RASQ=880.29;REF=456;REFPAIR=315;RP=0;RPQ=0.00;SB=0.7266187;SC=1X;SR=0;SRQ=0.00;SVTYPE=BND;VF=45

there is another call, which to me is the actual event, based on the coverage, but that one says LOW_QUAL and NO_ASSEMBLY:

CHROMOSOME_IV 13880132 gridss5fb_128990o C CAAGCTCAGCAGGCTCCACCAGCCTG[CHROMOSOME_IV:13880211[ 468.52 LOW_QUAL;NO_ASSEMBLY ANRP=0;ANRPQ=0.00;ANSR=20;ANSRQ=468.52;AS=0;ASC=1X;ASQ=0.00;ASRP=0;ASSR=0;BA=0;BAQ=0.00;BASRP=0;BASSR=0;BQ=0.00;BSC=0;BSCQ=0.00;BUM=0;BUMQ=0.00;BVF=0;CAS=0;CASQ=0.00;CQ=459.98;EVENT=gridss5fb_128990;IC=0;IHOMPOS=0,0;IQ=0.00;MATEID=gridss5fb_128990h;MQ=60.00;MQN=60.00;MQX=60.00;RAS=0;RASQ=0.00;REF=456;REFPAIR=315;RP=0;RPQ=0.00;SB=0.35;SC=94M1X;SR=20;SRQ=468.52;SVTYPE=BND;VF=20

is this a software issue? or can I somehow know which event is the real event?

If you need more information, please let me know.

d-cameron commented 1 year ago

can I somehow know which event is the real event?

They're probably both the same event. The event is a 79bp deletion with 26bp of sequence inserted into the deletion.

Smith-Waterman alignment doesn't like reporting DEL-with-INS since DEL and INS have two independent gap opening penalties and there is almost always going to be some chance homology between the inserted sequence and the deleted sequence. SW alignment will preferentially break up a DEL-with-INS into a set up multiple smaller SNVs and indels.

can I somehow know which event is the real event?

Have a look at the actual alignment of asm5-1336283 and asm5-614184 in the assembly.bam.gridss.working/assembly.bam.sv.bam file. I expect what you'll see is that a) in addition to the deletion ending at position 13880159, you're going to have a bunch of additional SNVs and indels <32bp in the assembly alignments; and that b) the haplotype sequences for both the asm-based call (gridss5fb_128989) and the direct read alignment based call (gridss5fb_128990) will be the same.

is this a software issue?

GRIDSS (nor any other caller that I'm aware of) does not perform haplotype-based deduplication of variant calls. It just so happens that this event is just the right size for the read alignments to get soft clipped instead of aligning across the deletion, but the longer flanking lengths of the assembly result in a spanning alignment, hence the two different calling conventions.

TLDR: Smith-Waterman doesn't handle DEL-with-INS events and they get broken into smaller SNVs+indels when the flanking sequence is long enough.

RobinVanSchendel commented 1 year ago

Your reasoning seems to be entire correct. The cigarStrings for asm5-1336283 and asm5-614184 are:

171M26D2M3D5M14D3M5D3M5D5M3D3M3I300M

and

299M26D2M3D5M14D3M5D3M5D5M3D3M3I174M

The interesting part for us is that we know that certain DNA repair pathways leave behind DEL + INS scars specifically. We study the underlying biology and we know that during repair of a DNA break those INS sequences are created and can be traced back to the junctions around the DEL. But unfortunately such events are not always part of the output of CNV callers. For example Delly and Smoove do not even bother with the INS part; they only output DEL.

Thanks for putting this on the enhancement list!