dantaki / SV2

Support Vector Structural Variation Genotyper
58 stars 11 forks source link

Is combined multiple samples same as joint calling #25

Closed justme66 closed 5 years ago

justme66 commented 5 years ago

Hi Dan,

I used SV2 on a trio. When I genotyped father/mother/child samples one by one. I combined them as the genotype matrix, followed by stringent filter (de novo filter) and parse out de novo deletions and duplications. I knew in advance that there is a ~150kb de novo in the patient. But in the genotype matrix, this SV is genotyped as 0/1(child) ./.(father) ./.(mother). I checked the features txt file for each sample. On such ~150kb region, the child was of course extracted informative features, but the parents were operated smaller variants (<100bp), which make it be ./. I guess. So my questions are:

  1. "sv2 -b all-svs.bed -feats all_features/ -p my.ped -o joint_sv2_genotypes" is only combined samples based on their own feature information, not joint calling, right?
  2. Can I force SV2 to genotype specific regions in parents after I extracted features and genotyped SVs for one sample at a time?
  3. When I use "sv2 -b all-svs.bed -feats all_features/ -p my.ped -o joint_sv2_genotypes" command. There is one error reported as: File "/python/path/to/sv2/Vcf.py", line 128, in load_genotypes if (locus[0]=='chrX' or locus[0]=='chrY') and Ped.males.get(sample_id)!=None and Structural_Variant.par[locus]==False: gt='.' KeyError: ('chrX', 150533471, 150533482, 'DEL') That cannot output .vcf files. Is related with bed file or other inputs?

Thanks!

justme66 commented 5 years ago
  1. -merge option was used to merge breakpoint with n% reciprocal overlap, not SV A region and SV B region?
dantaki commented 5 years ago
  1. Yes you are correct, the command you provided in your question is useful if you want to combine features from the same SV in multiple people without accessing the BAM files

  2. Yes, this is how I typically run SV2. For a given family, I combine all the SV calls, I then pick the highest scoring SV for overlapping calls

  3. I think I know what is causing that error, but I want to make sure. Is the position for that SV correct? It looks like it's 11bp not 150kb. Can you please tell me the exact position of your de novo and the reference build so I can debug this?

  4. merging on reciprocal overlap of SVs uses the regions, so chr1:100-200 and chr1:80-200 would be merged into the SV region with the highest median ALT genotype likelihood. To be honest, merging SVs is an area of debate, I think it makes more sense to use CIPOS and CIEND to merge SVs.

I have a feeling that if you run feature extraction (pass only -pre sv2_preprocessing/) on the parents, this error might resolve.

justme66 commented 5 years ago

Thanks for your reply.

  1. This error happened in father not children. The SV position is correct and it's not on the ~150 de novo region. When I use the command: sv2 -i father.cram -v lumpy.vcf -snv father.vcf.gz -p trios.ped -o father -genome hg38 , it give the error: File "./sv2/Vcf.py", line 128, in load_genotypes if (locus[0]=='chrX' or locus[0]=='chrY') and Ped.males.get(sample_id)!=None and Structural_Variant.par[locus]==False: gt='.' KeyError: ('chrX', 137012104, 137012137, 'DEL'). If I change the sv calls from manta, it give the error: KeyError: ('chrX', 145935451, 145935454, 'DEL'). Both situation cannot get the vcf output, but have bed output. the reference build is hg38. I didn't filter the SV calls (from manta and lumpy) before I use SV2. Is that the reason?
  2. Thanks for you detailed example. So -merge option are similar as bedtools intersect -f 0.8 -r? Is that should be "regions with reciprocal overlap or greater to each other are merged" not "breakpoints with reciprocal overlap"?

Thanks!

dantaki commented 5 years ago

What version of python are you running? SV2 requires python 2.7

I ask because I cannot replicate your error. You can try changing that line to

 if (locus[0]=='chrX' or locus[0]=='chrY') and Ped.males.get(sample_id)!=None and Structural_Variant.par.get(locus)==False: gt='.'

and I think that should fix the Key Error.

However, SV2 first parses all SV input you provide, and for variants on the chrX and chrY chromosomes, it loads that Structural_Variant.par dict with either True or False entries (par is the pseudoautosomal regions)

So that means, any chrX or chrY variants you provide as input are in that dict.

  1. Yes it's similar but uses the highest median ALT genotype likelihood. A SV region is nothing without breakpoints; you cannot have variable reciprocal overlap with only breakpoints too. It's really semantics at this point, I chose to say "breakpoints" since regions can encompass non-SVs technically. So in the example above, the breakpoints are 100/200 and 80/200 (obviously in BED format you'll have 99-100: 199-200 :: 79-80:199-200, so how can you merge those intervals according to reciprocal overlap?)

Here's my test on a male individual


test.bed

chrX    137012104       137012137       DEL
chrX    145935451       145935454       DEL

command:

sv2 -i NA19239.alt_bwamem_GRCh38DH.20150715.YRI.high_coverage_markedDupsRemoved.bam -b test.bed -p test.ped -dels-only -g hg38 -pre prepro/

STDOUT:

sv2 version:1.4.4    report bugs to <dantaki at ucsd dot edu>       error messages located in STDOUT
************************************************
 PREPROCESSING COMPLETE <time elapsed: 0:00:01>
************************************************
*****************************************************
 FEATURE EXTRACTION COMPLETE <time elapsed: 0:00:04>
*****************************************************
*********************************************
 GENOTYPING COMPLETE <time elapsed: 0:00:35>
*********************************************

VCF output

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA19239
chrX    137012105       .       .       <DEL>   19      STR;FAIL        END=137012137;SVTYPE=DEL;SVLEN=33;DENOVO_FILTER=FAIL;REF_GTL=NA;AF=1.0;CYTOBAND=Xq26.3;REPEATMASKER=NA,0;1000G_ID=NA;1000G_OVERLAP=NA;DESCRIPTION=chrX:137012105-137012137_deletion;GENES=intergenic;ABPARTS=0;CENTROMERE=0;GAP=0;SEGDUP=0;STR=0.97;UNMAPPABLE=0        GT:CN:PE:SR:SC:NS:HA:NH:SQ:GL   1:0.00:0.00:0.00:nan:0:nan:0:19:0.01,0.99
chrX    145935452       .       .       <DEL>   19      FAIL    END=145935454;SVTYPE=DEL;SVLEN=3;DENOVO_FILTER=FAIL;REF_GTL=NA;AF=1.0;CYTOBAND=Xq27.3;REPEATMASKER=NA,0;1000G_ID=NA;1000G_OVERLAP=NA;DESCRIPTION=chrX:145935452-145935454_deletion;GENES=intergenic;ABPARTS=0;CENTROMERE=0;GAP=0;SEGDUP=0;STR=0;UNMAPPABLE=0    GT:CN:PE:SR:SC:NS:HA:NH:SQ:GL   1:0.00:0.00:0.00:nan:0:nan:0:19:0.01,0.99
dantaki commented 5 years ago

paste in the entire SV2 command and your version of python and SV2. If your version is not version 1.4.4 I would uninstall sv2 and reinstall it with pip

dantaki commented 5 years ago

Also SV2 is designed for variants >50bp and performs best for variants >150bp. I would remove those small variants since they are really INDELs (2-49bp). Also if a variant completely overlaps a short tandem repeat (STR), segmental duplication, and/or assembly gaps, I would remove it because SV2 won't genotype them anyway.

justme66 commented 5 years ago

Thanks a lot, Dan. The version of python and SV2 are 2.7 and 1.4.3.4.

dantaki commented 5 years ago

Try to update SV2 and see if that resolves the error

WeiCSong commented 4 years ago

Hi, may i ask how to "force SV2 to genotype specific regions in parents after I extracted features and genotyped SVs for one sample at a time"? Thanks for your help.

dantaki commented 4 years ago

You can provide the regions in BED format but you will need to extract features and genotype those new regions.