FelixKrueger / SNPsplit

Allele-specific alignment sorting
http://felixkrueger.github.io/SNPsplit/
GNU General Public License v3.0
53 stars 20 forks source link

SNPsplit_genome_preparation processing error on custom vcf and genome #47

Closed wwang-chcn closed 3 years ago

wwang-chcn commented 3 years ago

I am experiencing processing error when using SNPsplit_genome_preparation.

SNPsplit Genome Preparation version: 0.4.0

command:

SNPsplit_genome_preparation --vcf_file ~/project/zebrafish_resequencing/deepvariant/merged.vcf.gz --strain TU --dual_hybrid --strain2 TL --reference_genome ref_genome

processing log:

Strain defined as 'TU' (strain index: 9)
Dual Hybrid strain selected
Strain2 defined as 'TL' (strain2 index: 10)
Reference genome folder provided is ref_genome/ (absolute path is '/mnt/Storage/home/wangwen/source/bySpecies/danRer11_2/strains/ref_genome/)'

Summarising SNPsplit Genome Preparation Parameters
==================================================
Processing SNPs from VCF file:      /mnt/Storage/home/wangwen/project/zebrafish_resequencing/deepvariant/merged.vcf.gz
Reading/filtering VCF file:     Yes (default)
Reference genome:           /mnt/Storage/home/wangwen/source/bySpecies/danRer11_2/strains/ref_genome/
N-masking:              Yes
Full SNP genome:            Yes
SNP strain:             TU
SNP strain 2:               TL
Dual hybrid, new Ref/SNP:       TU/TL

Using the following chromosomes (detected from VCF file >>/mnt/Storage/home/wangwen/project/zebrafish_resequencing/deepvariant/merged.vcf.gz<<):
chr10   chr8    chr17   chr21   chrM    chr25   chr9    chr22   chr15   chr5    chr2    chr6    chr12   chr16   chr7    chr11   chr1    chr3    chr14   chr18   chr24   chr23   chr20   chr19   chr4    chr13

Analysing SNP fields for name >TU<
Use of uninitialized value $fi in string ne at /mnt/Storage/home/wangwen/bin/SNPsplit_genome_preparation line 976, <IN> line 44.
Use of uninitialized value $fi in string ne at /mnt/Storage/home/wangwen/bin/SNPsplit_genome_preparation line 976, <IN> line 45.

vcf file

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FILTER=<ID=RefCall,Description="Genotyping model thinks this site is reference.">
##FILTER=<ID=LowQual,Description="Confidence in this variant being real is below calling threshold.">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position (for use with symbolic alleles)">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Conditional genotype quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read depth">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block.">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Read depth for each allele">
##FORMAT=<ID=VAF,Number=A,Type=Float,Description="Variant allele fractions.">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Phred-scaled genotype likelihoods rounded to the closest integer">
##FORMAT=<ID=FI,Number=1,Type=Integer,Description="Whether a sample was a Pass(1) or fail (0) based on FILTER values">
##DeepVariant_version=1.0.0
##contig=<ID=chr1,length=59578282>
##contig=<ID=chr10,length=45420867>
##contig=<ID=chr11,length=45484837>
...
##bcftools_mergeVersion=1.6-37-g9a38c20+htslib-1.6-38-g8003166
##bcftools_mergeCommand=merge -Oz -o merged.vcf.gz TU.filtered.vcf.gz TL.filtered.vcf.gz; Date=Tue Apr  6 16:26:23 2021
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  TU      TL
chr1    119     .       TG      T       0       RefCall .       GT:GQ:DP:AD:VAF:PL:FI   ./.:.:.:.:.:.:. 0/0:29:27:24,3:0.111111:0,28,48:0
chr1    187     .       A       T       0       RefCall .       GT:GQ:DP:AD:VAF:PL:FI   0/0:20:11:9,2:0.181818:0,20,53:0        ./.:.:.:.:.:.:.
chr1    235     .       C       T       0.4     RefCall .       GT:GQ:DP:AD:VAF:PL:FI   ./.:.:.:.:.:.:. ./.:10:38:32,5:0.131579:0,9,37:0

Could help me?

FelixKrueger commented 3 years ago

Hi @wwang-chcn,

It appears the VCF file is somewhat reasonably formatted, so the SNPsplit approach might just work.

You will need to identify the line that splits the FORMAT field according to the information from the field INFO:

https://github.com/FelixKrueger/SNPsplit/blob/388128c7a9e1e7419de020bf34d15d23cf051ae1/SNPsplit_genome_preparation#L951

Since INFO is GT:GQ:DP:AD:VAF:PL:FI, I assume it could be something like this:

my ($gt,$gq,$dp,$ad,$vaf,$ad,$pl,$fi);

# INFO:      GT:GQ:DP:AD:VAF:PL:FI
($gt,$gq,$dp,$ad,$vaf,$ad,$pl,$fi) = split/:/,$strain;

You might then also have to think about which genotypes you wany to tolerate, e.g. only homozygous variants such as 1/1, or whether or not you require the filter value ($fi) to be 1. You should be able to read through the script and make the required adjustments.

Let me know if you encounter additional issues.

Cheers, Felix

wwang-chcn commented 3 years ago

Hi @wwang-chcn,

It appears the VCF file is somewhat reasonably formatted, so the SNPsplit approach might just work.

You will need to identify the line that splits the FORMAT field according to the information from the field INFO:

https://github.com/FelixKrueger/SNPsplit/blob/388128c7a9e1e7419de020bf34d15d23cf051ae1/SNPsplit_genome_preparation#L951

Since INFO is GT:GQ:DP:AD:VAF:PL:FI, I assume it could be something like this:

my ($gt,$gq,$dp,$ad,$vaf,$ad,$pl,$fi);

# INFO:      GT:GQ:DP:AD:VAF:PL:FI
($gt,$gq,$dp,$ad,$vaf,$ad,$pl,$fi) = split/:/,$strain;

You might then also have to think about which genotypes you wany to tolerate, e.g. only homozygous variants such as 1/1, or whether or not you require the filter value ($fi) to be 1. You should be able to read through the script and make the required adjustments.

Let me know if you encounter additional issues.

Cheers, Felix

Thanks Felix,

I just find the key about this question and create a pull requst #48 that can handle variable order of INFO.

Yours, Wen

FelixKrueger commented 3 years ago

I have now tried to write a stable auto-detect version, which works for both the v5 and v7 Mouse Genomes Project files. Could you clone the current dev version and see if it also works with your fishes? Edited here: 5e605d86c28ba32b1b9b3388117a64e7ac397097

FelixKrueger commented 3 years ago

So I assume we can close this issue now?

wwang-chcn commented 3 years ago

Yes :)