DaehwanKimLab / hisat2

Graph-based alignment (Hierarchical Graph FM index)
GNU General Public License v3.0
479 stars 119 forks source link

hisat2_extract_snps_haplotypes_VCF.py can't handle GATK VCF file? #76

Open zhaouu opened 8 years ago

zhaouu commented 8 years ago

I use hisat2_extract_snps_haplotypes_VCF.py to extract SNPs and haplotypes from a VCF. When I used the VCF file which extracted from samtools, it can ran successfully. But when I used the VCF file which extracted from GATK, I get an error massge:

Traceback (most recent call last): File "/public/home/hzhao/software/hisat2-2.0.4/hisat2_extract_snps_haplotypes_VCF.py", line 894, in args.verbose) File "/public/home/hzhao/software/hisat2-2.0.4/hisat2_extract_snps_haplotypes_VCF.py", line 732, in main genotypes) File "/public/home/hzhao/software/hisat2-2.0.4/hisat2_extract_snps_haplotypes_VCF.py", line 690, in add_vars tmp_vars = extract_vars(chr_dic, chr, pos, ref_allele, alt_alleles, varID) File "/public/home/hzhao/software/hisat2-2.0.4/hisat2_extract_snps_haplotypes_VCF.py", line 113, in extract_vars assert ref_allele2 != alt_allele AssertionError

And, I GATK latest version, they use an asterisk represents a deleted allele. I think it's not compatible with this script. How can I generate the snp and haplotype file for hisat2? Hope some suggests for me. Thanks very much!

infphilo commented 8 years ago

Would you like to show some example lines of the VCF file from GATK including deletions so that I can modify the HISAT2 scripts?

zhaouu commented 8 years ago

A sample VCF file can be downloaded in https://drive.google.com/open?id=0B8ijxd00y5RZUDJaVVF4aUtNZmM. The link ( http://gatkforums.broadinstitute.org/gatk/discussion/5904/asterisk-in-the-alt-feild-of-vcf) can find the explanation of asterisk "*" in GATK VCF file. Because the VCF file don't have ID column, I modified the hisat2_extract_snps_haplotypes_VCF.py script: chr, pos, varID, ref_allele, alt_alleles, qual, filter, info = line.strip().split('\t') varID = '%s_%s' %(chr,pos) Commend line to run the script: `~/software/hisat2-2.0.4/hisat2_extract_snps_haplotypes_VCF.py --non-rs --verbose ~/data/Brassicanapus/Brassica_napus_v4.1.chromosomes.fa /public/home/hhao/data/rape_seq/RNAseq/test/test.vcf.gz ~/data/rape_seq/hisat2_snp

Thanks for your help.