fritzsedlazeck / SURVIVOR

Toolset for SV simulation, comparison and filtering
MIT License
344 stars 46 forks source link

Intrachromosomal BND to INV type? #120

Open alsanju opened 4 years ago

alsanju commented 4 years ago

Hi Fritz,

Thanks for the tool.

Is there an option for SURVIVOR not to change the SV type from BND to INV where both breakpoints are on the same chromosome?

Because some variant callers do not call SV types but breakpoints instead, and many of those are not inversions, I would prefer to assign the SV type myself depending on multiple calling evidence.

If not, I could do this by parsing the type from the original VCF, but the IDs of all the merged SVs are not kept in the sample column, only the coordinates (and the ID of the representative one). Having only coordinates is problematic when the SV was called at the exact same coordinates by multiple variant callers. Any ideas on that?

Thank you! Alba

fritzsedlazeck commented 4 years ago

Hi Alba, Right now not really.

Yeah, I see the problem with that. The reason I did that was that I used Lumpy and others which reports BND for either INV or TRA.. to enable a comparison to other SV callers I need to guess the type behind BND. Thus, the parser already makes this guess.. Thus, I dont quite know how to resolve this by still allowing to have a type sensitive merge...

I would be open to suggestions Thanks Fritz

alsanju commented 4 years ago

Hi Fritz,

Thanks for the quick reply!

I see your point. Maybe intrachromosomal BND could be defined as INV if no other merged SVs support a different type?

Maybe an example would also help. These are the calls before the intra-sample merge of a specific duplication (data from short and long read sequencing):

X 37362373 X_37362373_X_37675672_INV_sniffles N <INV> . PASS CHR2=X;END=37675672;RE=0;IMPRECISE;SVLEN=313299;REF_strand=0,0;SVMETHOD=Snifflesv1.0.11;SVTYPE=INV;SEQ;SUPTYPE=input;STRANDS=-+;AF=0 GT:DR:DV ./.:0:0 X 37362374 X_37362374_X_37675681_DUP_Manta C <DUP:TANDEM> 185 MGE10kb SVTYPE=DUP;SVLEN=313307;END=37675681;SVINSLEN=6;SVINSSEQ=AAGTCA;PAIR_COUNT=11;UPSTREAM_PAIR_COUNT=14;DOWNSTREAM_PAIR_COUNT=12;ensembl_gene_id=ENSG00000130962,ENSG00000250349,ENSG00000241607,ENSG00000219186,ENSG00000234355,ENSG00000174678,ENSG00000147036;ColocalizedCanvas GT:GQ:PR:SR 0/1:185:48,11:43,8 X 37362374 X_37362374_X_37675671_DUP_nanosv C <DUP> 1027 MapQual IMPRECISE;SVTYPE=DUP;SVMETHOD=NanoSV_v1.2.4;END=37675671;CIPOS=0,30;CIEND=-14,0;SVLEN=313297;RT=0;GAP=24;MAPQ=60,60;PID=0.936,0.939;PLENGTH=0.518,0.323;RLENGTH=12966;ALT_READ_IDS=a788c243-b362-43a6-b4d8-b3d61e37d5a7....;DEPTHPVAL=0 GT:DR:DV:GQ:HR:PL 0/1:42,47:24,24:499:0,0:1027,0,499 X 37362379 X_37362379_X_37675668_BND_svim N ]X:37675668]N 29 PASS SVTYPE=BND;SUPPORT=24;STD_POS1=9;STD_POS2=4 GT:DP:AD ./.:.:.,. X 37362414 X_37362414_X_37675252_DUP_Canvas N <CNV> 61 PASS SVTYPE=DUP;END=37675252;ensembl_gene_id=ENSG00000130962,ENSG00000250349,ENSG00000241607,ENSG00000219186,ENSG00000234355,ENSG00000174678,ENSG00000147036 GT:RC:BC:CN 0/1:146:251:3

I am merging with SURVIVOR Version: 1.0.7, command SURVIVOR merge samples.fofn 500 1 -1 -1 -1 -1 output.vcf. My output is:

X 37362373 X_37362414_X_37675252_DUP_Canvas N ]X:37675668]N 1027 PASS SUPP=1;SUPP_VEC=1;SVLEN=313307;SVTYPE=INV;SVMETHOD=SURVIVOR1.0.7;CHR2=X;END=37675672;CIPOS=0,41;CIEND=-420,9;STRANDS=-+ GT:PSV:LN:DR:ST:QV:TY:ID:RAL:AAL:CO 0/1:NA:313307:42,24:-+:.,185,1027,29,61:INV,DUP,DUP,INV,DUP:X_37362414_X_37675252_DUP_Canvas:NA:NA:X_37362373-X_37675672,X_37362374-X_37675681,X_37362374-X_37675671,X_37362379-X_37675668,X_37362414-X_37675252

So, the SV type in the output is an INV, when this is actually not right and there are 3 callers supporting the DUP and only 1 the INV. I guess that's because the INV is the type of the representative call?

For that reason, I wanted to manually assign the SV based on specific criteria (considering the number of variant callers supporting each type and prioritizing some callers over others for specific types of SVs).

However, I have trouble when parsing back the SVs because the IDs are not kept in the output, only the coordinates, and the type of the BNDs have been changed to INV. Would it be possible somehow to get that information?

Also, related to the merge example from above, I have two additional questions:

1) It seems that the representative call picked is the INV from sniffles, when that's actually not the median start location. Is that because sniffles calls are prioritised over others? 2) Why are the ID, ALT and QUAL field coming from different SVs and not the representative one?

Thank you very much!! Alba

fritzsedlazeck commented 4 years ago

yes. So what happens is that all the calls getting sorted by start location and the median start location variant are getting picked. This one gets reported as the representative. I thought of a consensus way over breakpoints/types, but then you have the risk of ending up with an event that has not been observed.

The ID from the previous VCF file should be included in the genotype field (ID). Please let me know if not.

Ad1: yeah that is a bit strange. Is it a newer version that you are running from the git?

Ad2: a) For qual, I am reporting the maximum quality since many SV callers don't report quality at all. b) For the ID: if it has a - its ignored (some historical reasons don't remember quite frankly) and we are merging ID's if there are multiple entries. c) ALT: same as ID.

happy to take a look if you could send these VCF files over. Thanks Fritz

alsanju commented 4 years ago

I have just sent you an email with the VCFs.

After the merge, only the ID for the representative call is kept in the sample field, and quality (QV), type (TY) and coordinates (CO) are reported for all the merged calls.

Ad1: I am using SURVIVOR Version: 1.0.7 that I got from conda.

Ad2: a) makes sense b) and c) all my variants have an ID and ALT that is not - but are not being merged.

Thank you! Alba