dantaki / SV2

Support Vector Structural Variation Genotyper
58 stars 11 forks source link

sv2 python IndexError #7

Closed mesnger closed 6 years ago

mesnger commented 6 years ago

hi. Im trying to use this latest SVtool for my WGS sample. my version is sv2 1.4.3.2.

here is my script sv2 -i bwa.aligned.gatkbestpractice.bam -v lumpy.svtyper.vcf -snv haplotypecaller.vcf -p data.ped -g hg38 -merge -min-ovr 0.7 -L sv2.log -o sv2

I got the PREPROCESSING COMPLETE message but an error occurs soon after.

Traceback (most recent call last): File "/.../SV2/build/scripts-2.7/sv2", line 3, in <module> if __name__=='__main__': main() File "/.../python2.7/site-packages/sv2/sv2.py", line 204, in main extract_feats(bam,SV.raw,prefh,featfh,gen,pcrfree,legacy_m,Confs,tmp_dir) File "sv2/FeatureExtraction.pyx", line 20, in sv2.FeatureExtraction.extract_feats File "sv2/Snv.pyx", line 34, in sv2.Snv.extract_snv_features File "sv2/Snv.pyx", line 70, in sv2.Snv.c_extract_snv_features File "sv2/Snv.pyx", line 137, in sv2.Snv.tokenize_vcf IndexError: list index out of range

could it be a compatibility issue with other modules? can i get any help with this error?

dantaki commented 6 years ago

Is your -snv file bgzipped and tabixed? That would be the first thing to check

I think the error is coming from parsing out the heterozygous allele depths.

The Index Error indicates that the list of allele depths is shorter than the number of alleles for a variant. I imagine this happened because there is either an empty entry (like GT:AD:DP 0/1:.:.) or some formatting error in your VCF.

I wrote a fix but I would like for you to test it on your VCF. I tested it on the tutorial, and there were no errors. So I hope this will fix your error.

Please download the tarball attached in this comment. Uninstall SV2 and install the test version I supplied

$ pip uninstall sv2 -y
$ pip install sv2-1.4.3.2test.tar.gz

Follow the install instructions in the manual and please try running it again. sv2-1.4.3.2test.tar.gz

mesnger commented 6 years ago

thanks for the quick and helpful reply. i will test the new script, but will take a while as i am not the root.

further information of the -snv files is that i used GATK4 haplotypecaller with -ERC GVCF -G StandardHCAnnotation -G AS_StandardAnnotation -new-qual then used gvcftools to get variants only(i did not test if sv2 supports genotypeGVCF vcfs) and the format column had no AD, so i manually added appropritate AD header, format column, and AD value. so the vcf length might not be what you may have expected.

here is the GATK4.0.2.1 Haplotypecaller header for FORMAT

fileformat=VCFv4.2 ...

ALT=<ID=NON_REF,Description="Represe ...

FILTER=<ID=LowQual,Description="Low ...

FORMAT=<ID=DP,Number=1,Type=Integer, ...

FORMAT=<ID=GQ,Number=1,Type=Integer, ...

FORMAT=<ID=GT,Number=1,Type=String,D ...

FORMAT=<ID=MIN_DP,Number=1,Type=Inte ...

FORMAT=<ID=PGT,Number=1,Type=String, ...

FORMAT=<ID=PID,Number=1,Type=String, ...

FORMAT=<ID=PL,Number=G,Type=Integer, ...

FORMAT=<ID=SB,Number=4,Type=Integer, ...

GATKCommandLine=<ID=HaplotypeCaller, ...

i will try to change the options so if the AD column shows when omitting some of the HC options. and if not, i will also try GATK3.8 HC which definitely has the AD column for trying out while waiting for the root to install the test version of sv2.

i hope to reply the results soon.

dantaki commented 6 years ago

Thanks for the response.

SV2 requires VCFs to be bgzipped and tabixed. To genotype using the heterozygous allele ratio classifier for duplications, the FORMAT entry must have AD with each sample containing an AD value delimited by commas

chr1  100 ...   GT:AD:GQ  0/1:5,5:99   0/1:5,10:99    0/0:10,0:99

The allele depth entry for the samples must be in the same position as the FORMAT entry (index number 1 above)

If you edited the VCFs appropriately, then it should work. Unfortunately I'm not familiar with GATK4+ output and GVCFs. SV2 was not designed with GVCFs in mind, and uses the pysam library to parse VCFs. So if your version of pysam can handle GVCFs, then it's possible it might work (assuming the AD entries are formatted correctly)

Thanks for the response and keep me updated.

mesnger commented 6 years ago

yes. thankfully, my vcf is bgzipped, and tabix indexed.

i got a clue from your reply that the error might come from multi-allele sites, so i removed multi-allelic sites with bcftools. and rerun sv2. it seems to get pass the initial checking step. and is running for about 10mins currently.

i am quite new to GATK4 myself, so i will tweak more to find what option of GATK4 HC is supported by sv2.

dantaki commented 6 years ago

It does account for the multiallelic sites but it assumes the order is

REF,ALT,(ALT2),(ALT3)...

I've noticed that if you merged mutliple VCFs and split multiallelic sites with bcftools it sometimes fails to account for the split (depending on the header information) https://github.com/samtools/bcftools/issues/771

Additionally splitting multiallelics can have weird results

chr1  100 .  A  C,T   0/0 0/1 2/2

split by bcftools

chr1  100 . A  C   0/0 0/1 0/0
chr1 100 . A  T    0/0 0/0  1/1

Note that sample 3 for A>C is now A/A instead of T/T for the first split entry

multiallelics can be a headache

mesnger commented 6 years ago

there seems to be a bug in GATK4.0.2.1 haplotypecaller where if you pass group annotation -G StandardHCAnnotation -G AS_StandardAnnotation (one or both), that the AD in FORMAT column is some how missing. the subsequent step of genotypeGVCFs failed due to the missing AD column.

sv2 succeed when i manually input the AD column, value, and header correctly. but the effort for changing all output is quite a burden. So my conclusion is to go back to GATK 3.8 which does normally output AD column in FORMAT.

In the end the problem was gatk4 HC and not sv2 afterall.

I thank you for making a great tool, and also helping me find the error.

dantaki commented 6 years ago

Awesome, yeah I hear GATK 4+ is not as stable as people would like. Glad to hear your problem was solved. Let me know if there are any other questions.