FelixKrueger / SNPsplit

Allele-specific alignment sorting
http://felixkrueger.github.io/SNPsplit/
GNU General Public License v3.0
51 stars 19 forks source link

Error with SNPsplit_genome_preparation #52

Closed jprippe closed 2 years ago

jprippe commented 2 years ago

Hi Felix,

I'm having an issue with the SNPsplit_genome_preparation (v0.5.0) processing step. There seems to be an issue with variable definition, but I'm unfortunately not familiar enough with Perl to troubleshoot myself.

Command:

SNPsplit_genome_preparation --vcf_file snp.noMiss.fix.withFI.vcf --strain lane1-A1-A_S1_L001_1 --strain2 lane1-A3-C_S3_L001_1 --reference_genome $STOCKYARD/stampede2/db/Amil_chr1/

Output log:

Strain defined as 'lane1-A1-A_S1_L001_1' (strain index: 9)
Strain 2 specified, setting option '--dual_hybrid'
Dual Hybrid strain selected
Strain2 defined as 'lane1-A3-C_S3_L001_1' (strain2 index: 11)
Reference genome folder provided is /work/05703/jprippe/stampede2/db/Amil_chr1/ (absolute path is '/work2/05703/jprippe/stampede2/db/Amil_chr1/)'

Summarising SNPsplit Genome Preparation Parameters
==================================================
Processing SNPs from VCF file:      snp.noMiss.fix.withFI.vcf
Reading/filtering VCF file:     Yes (default)
Reference genome:           /work2/05703/jprippe/stampede2/db/Amil_chr1/
N-masking:              Yes
Full SNP genome:            Yes
SNP strain:             lane1-A1-A_S1_L001_1
SNP strain 2:               lane1-A3-C_S3_L001_1
Dual hybrid, new Ref/SNP:       lane1-A1-A_S1_L001_1/lane1-A3-C_S3_L001_1

Using the following chromosomes (detected from VCF file >>snp.noMiss.fix.withFI.vcf<<):
chr1,length=39361238,assembly=null,md5=a4bc6abe85cd962980b8fe1c5a7ddb94

Analysing SNP fields for name >lane1-A1-A_S1_L001_1<
Using FORMAT field index: 8
Using INFO field index: 7
GT index not defined, checking...
Setting GT index to >>0<<
Setting FI index to >>11<<
Can't use an undefined value as a symbol reference at /home1/05703/jprippe/SNPsplit-0.5.0/SNPsplit_genome_preparation line 1145, <IN> line 63.

If it's worth noting, I'm using a custom VCF and I had the program working a few weeks ago after modifying the v0.4.0 code to address a different error (following #47). Since then, I've had to switch computing systems and haven't been able to reproduce the success with either the newest release or previous one.

VCF (in case it helps...):

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##BisSNP-1.0.0 Program Args= -R /work/05703/jprippe/stampede2/db/Amil_chr1/chr1.fasta -T BisulfiteGenotyper -nonDirectional -nt 12 -I lane1-A1-A_S1_L001_1.trim_bismark_bt2_pe.deduplicated_rg.bam -I lane1-A2-B_S2_L001_1.trim_bismark_bt2_pe.deduplicated_rg.bam -I lane1-A3-C_S3_L001_1.trim_bismark_bt2_pe.deduplicated_rg.bam -I lane1-A4-D_S4_L001_1.trim_bismark_bt2_pe.deduplicated_rg.bam -vfn1 cpg.raw.vcf -vfn2 snp.raw.vcf
##FORMAT=<ID=BQ,Number=.,Type=Float,Description="Average base quality for reads supporting alleles. For each allele, in the same order as listed">
##FORMAT=<ID=BRC6,Number=.,Type=Integer,Description="Bisulfite read counts(not filtered by minConv): 1) number of C in cytosine strand, 2) number of T in cytosine strand, 3) number of A/G/N in cytosine strand, 4) number of G in guanine strand, 5) number of A in guanine strand, 6) number of C/T/N in guanine strand.">
##FORMAT=<ID=CM,Number=1,Type=Integer,Description="Number of Unconverted Cytosines in this position(filtered by minConv)">
##FORMAT=<ID=CP,Number=1,Type=String,Description="Best Cytosine Context">
##FORMAT=<ID=CU,Number=1,Type=Integer,Description="Number of Converted Cytosines in this position(filtered by minConv)">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth, not filtered by mapping quality and base quality score criteria yet">
##FORMAT=<ID=DP4,Number=4,Type=Integer,Description="Reads supporting ALT, only keep good base. Number of 1) forward ref alleles; 2) reverse ref; 3) forward non-ref; 4) reverse non-ref alleles">
##FORMAT=<ID=GP,Number=G,Type=Integer,Description="Normalized, Phred-scaled posteriors for genotypes as defined in the VCF specification">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=SS,Number=1,Type=Integer,Description="Somatic status of the variant. 1) wildtype; 2) germline, somatic; 3) LOH; 4) post-transcriptional modification; 5) unknown">
##INFO=<ID=CS,Number=1,Type=Character,Description="Strand of cytosine relative to reference genome. Does not apply to non-cytosine positions">
##INFO=<ID=Context,Number=.,Type=String,Description="Cytosine Context. e.g. 'CG' means homozygous CG sites, while heterozygous CG sites will be output as IUPAC code">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth, not filtered by mapping quality and base quality score criteria yet">
##INFO=<ID=HQ,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">
##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
##INFO=<ID=REF,Number=.,Type=String,Description="Reference Cytosine Context. if all cytosine context found are in reference genome, REF=0; otherwise, e.g. it will be REF=CG for position not confidently called as CG or real context is CH.">
##INFO=<ID=SB,Number=1,Type=Float,Description="Strand Bias">
##assembly=none
##center=none
##contig=<ID=chr1,length=39361238,assembly=null,md5=a4bc6abe85cd962980b8fe1c5a7ddb94,species=null>
##fileDate=20210802
##geneAnno=none
##phasing=none
##reference=<ID=null,Source=file:/work2/05703/jprippe/stampede2/db/Amil_chr1/chr1.fasta>
##tcgaversion=1.1
##vcfProcessLog=<InputVCF=<>,InputVCFSource=<>,InputVCFVer=<1.1>,InputVCFParam=<> InputVCFgeneAnno=<>>
##bcftools_viewVersion=1.13+htslib-1.13
##bcftools_viewCommand=view -e GT[*]="mis" snp.sorted.vcf; Date=Fri Aug  6 10:56:27 2021
##FORMAT=<ID=FI,Number=1,Type=Integer,Description="Whether a sample was a Pass(1) or fail (0) based on FILTER values">
##bcftools_annotateVersion=1.13+htslib-1.13
##bcftools_annotateCommand=annotate -a annot.noMiss.txt.gz -h hdr.txt -s lane1-A1-A_S1_L001_1 -c CHROM,POS,FORMAT/FI snp.noMiss.fix.vcf; Date=Fri Aug  6 16:00:26 2021
##bcftools_annotateCommand=annotate -a annot.noMiss.txt.gz -h hdr.txt -s lane1-A2-B_S2_L001_1 -c CHROM,POS,FORMAT/FI -; Date=Fri Aug  6 16:00:26 2021
##bcftools_annotateCommand=annotate -a annot.noMiss.txt.gz -h hdr.txt -s lane1-A3-C_S3_L001_1 -c CHROM,POS,FORMAT/FI -; Date=Fri Aug  6 16:00:26 2021
##bcftools_annotateCommand=annotate -a annot.noMiss.txt.gz -h hdr.txt -s lane1-A4-D_S4_L001_1 -c CHROM,POS,FORMAT/FI -; Date=Fri Aug  6 16:00:26 2021
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  lane1-A1-A_S1_L001_1    lane1-A2-B_S2_L001_1    lane1-A3-C_S3_L001_1    lane1-A4-D_S4_L001_1
chr1    193 .   T   C,G 31.76   PASS    DP=23;MQ0=2;NS=4    GT:BQ:BRC6:CM:CP:CU:DP:DP4:GP:GQ:SS:FI  0/0:37,nan:0,6,1,0,0,0:.:.:.:.:2,4,0,0:0,27,28:27:5:1   0/0:37,nan:0,1,0,0,0,0:.:.:.:.:0,1,0,0:0,32,35:32:5:1   1/1:37,nan:0,6,0,0,1,0:.:.:.:.:5,2,0,0:2147483647,4,0:4:5:1 1/1:.:0,6,0,0,2,0:.:.:.:.:0,0,0,0:0,57,6:6:5:1
chr1    245 .   T   C,G 31.76   PASS    DP=27;MQ0=3;NS=4    GT:BQ:BRC6:CM:CP:CU:DP:DP4:GP:GQ:SS:FI  0/0:37,nan:0,7,0,0,2,0:.:.:.:.:4,5,0,0:0,31,34:31:5:1   0/0:37,nan:0,1,0,0,0,0:.:.:.:.:0,1,0,0:0,32,35:32:5:1   1/1:.:0,5,0,0,5,0:.:.:.:.:0,0,0,0:0,57,6:6:5:1  1/1:.:0,5,0,0,2,0:.:.:.:.:0,0,0,0:0,57,6:6:5:1
chr1    246 .   A   G,C 31.76   PASS    DP=28;MQ0=3;NS=4    GT:BQ:BRC6:CM:CP:CU:DP:DP4:GP:GQ:SS:FI  0/0:37,nan:0,0,8,1,0,1:.:.:.:.:4,5,0,0:0,31,34:31:5:1   0/0:37,nan:0,0,1,0,0,0:.:.:.:.:0,1,0,0:0,32,35:32:5:1   1/1:.:0,0,4,0,0,6:.:.:.:.:0,0,0,0:0,57,6:6:5:1  1/1:.:0,0,5,0,0,2:.:.:.:.:0,0,0,0:0,57,6:6:5:1
chr1    247 .   A   G,C 36.03   PASS    DP=29;MQ0=3;NS=4    GT:BQ:BRC6:CM:CP:CU:DP:DP4:GP:GQ:SS:FI  0/0:34.3,37:0,0,8,0,0,2:.:.:.:.:4,5,1,0:0,36,52:36:5:1  0/0:37,nan:0,0,1,0,0,0:.:.:.:.:0,1,0,0:0,32,35:32:5:1   1/1:.:0,0,3,0,0,7:.:.:.:.:0,0,0,0:0,57,6:6:5:1  1/1:37,37:0,0,6,0,0,2:.:.:.:.:1,6,1,0:2147483647,8,0:8:5:1

Any ideas?

Thanks! JP

FelixKrueger commented 2 years ago

Hi JP,

This is a bit tricky to answer (especially as it is late now and we've got friends round...). Do you think you could provide me your VCF file to take a look myself? I normally don't really support any custom bespoke genome preparation, but maybe it is something fairly trivial? If you could send it via email (just a few (thousand) lines might be enough) that would be great.

Cheers, Felix

FelixKrueger commented 2 years ago

@jprippe found that the chromosome line in the VCF file:

##contig=<ID=chr1,length=39361238,assembly=null,md5=a4bc6abe85cd962980b8fe1c5a7ddb94,species=null>

contains several entries separated by , (unlike the Mouse Genomes Project files), so changing the regex to a non-greedy match did the trick. Since it doesn't hurt with the MGP files, I changed it for the default processing as well. Thanks for reporting and fixing your own issue :) Good luck!