FelixKrueger / SNPsplit

Allele-specific alignment sorting
http://felixkrueger.github.io/SNPsplit/
GNU General Public License v3.0
52 stars 20 forks source link

How to define high-confidence SNPs? #78

Closed Ulixmanna closed 8 months ago

Ulixmanna commented 8 months ago

Hi, I now have 6 ovaries from cattle breed A and 1 sperm sample from breed B for whole genome resequencing and call snp using gatk (BWA comparison uses the NCBI reference genome of breed A), the process is

1) gatk HardFiltration

gatk VariantFiltration --variant WGS.raw.SNP.vcf.gz --filter-expression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || SOR > 3.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" --filter-name "snp_filter" --genotype-filter-expression "DP < 2 || DP > 
50" --genotype-filter-name "dp_fail" --output WGS.gatk_filter_zhu.snp.vcf.gz

2)Filter the filtered passes and keep the alleles

gatk SelectVariants --exclude-filtered true --restrict-alleles-to BIALLELIC --reference genomic.fna --variant WGS.gatk_filter_zhu.snp.vcf --output con.hardfiltered.snp.vcf

3) snp.vcf

According to the PASS in the FILTER in column 7, add FI information in column 9 (PASS is defined as 1, FAIL is defined as 0, and the header is modified to ##FORMAT=)

My aim is to perform allele expression analysis on RNA seq of hybrid embryos produced from breed A ovaries and breed B sperm, but I encountered two problems while constructing the N mask genome:

Problem 1: When running SNPsplit_genome_preparation --reference_genome reference --vcf snp.vcf --strain ovary1 --strain2 sperm --dual_hybrid the-- strain is set to any of the 6 ovaries of cattle breed A (which is what I'm using now), or is it set to all 6 ovary samples for a more complete reflection of breed A's variation (and how exactly should this be accomplished)

Question 2: What exactly are the two indicators (GT and FI) for SNPSplit to judge high-confidence SNPs (because all the snp loci FI in my filtered snp.vcf file are already 1, and as a result, there are still more than 400,000 loci that are low confidence being filtered out)

SNP position summary for strain sperm (based on genome build GRCm39)
============================================================
Positions read in total:        21702510

4615627 SNP were homozygous. Of these:
23227   SNP were homozygous and passed high confidence filters and were thus included into the D21013078 genome

Not included into D21013078 genome:
6098066 had the same sequence as the reference
0               had no clearly defined alternative base
10988817          Calls were neither 0/0 (same as reference) or 1/1, 2/2, 3/3 (homozygous SNP)
4592400         were homozygous but the filtering call was low confidence

Printed a single list of all SNPs to >all_SNPs_D21013078_GRCm39.txt.gz<...

my vcf file:

##fileformat=VCFv4.2
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele not already represented at thi
##FILTER=<ID=LowQual,Description="Low quality">
##FILTER=<ID=dp_fail,Description="DP < 2 || DP > 50">
##FILTER=<ID=snp_filter,Description="QD < 2.0 || FS > 60.0 || MQ < 40.0 || SOR > 3.0 || MQRankSum < -12.
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the ord
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad
##FORMAT=<ID=FT,Number=.,Type=String,Description="Genotype-level filter">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=FI,Number=1,Type=Integer,Description="High confidence (1) or low confidence (0) based on so
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing ho
##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phasing set (typically the position of the first vari
##FORMAT=<ID=RGQ,Number=1,Type=Integer,Description="Unconditional reference genotype confidence, encoded
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fi
##GATKCommandLine=<ID=CombineGVCFs,CommandLine="CombineGVCFs --output /storage2/anlei/zhaomengmeng/Bovin
##GATKCommandLine=<ID=SelectVariants,CommandLine="SelectVariants --output con.hardfiltered.snp.vcf --exc
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the 
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order 
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. 
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filter
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=ExcessHet,Number=1,Type=Float,Description="Phred-scaled p-value for exact test of excess hete
##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect 
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele 
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele fr
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
##INFO=<ID=RAW_MQandDP,Number=2,Type=Integer,Description="Raw data (sum of squared MQ and total depth) f
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs
##INFO=<ID=SOR,Number=1,Type=Float,Description="Symmetric Odds Ratio of 2x2 contingency table to detect 
##contig=<ID=NC_037328.1,length=158534110,assembly=GCF_002263795.2_ARS-UCD1.3_genomic.fna>
##contig=<ID=NC_037329.1,length=136231102,assembly=GCF_002263795.2_ARS-UCD1.3_genomic.fna>
##contig=<ID=NC_037330.1,length=121005158,assembly=GCF_002263795.2_ARS-UCD1.3_genomic.fna>
##contig=<ID=NC_037331.1,length=120000601,assembly=GCF_002263795.2_ARS-UCD1.3_genomic.fna>
##contig=<ID=NC_037332.1,length=120089316,assembly=GCF_002263795.2_ARS-UCD1.3_genomic.fna>
##contig=<ID=NW_020192292.1,length=9309904,assembly=GCF_002263795.2_ARS-UCD1.3_genomic.fna>
##contig=<ID=NW_020192293.1,length=27572,assembly=GCF_002263795.2_ARS-UCD1.3_genomic.fna>
##contig=<ID=NW_020192294.1,length=986051,assembly=GCF_002263795.2_ARS-UCD1.3_genomic.fna>
##contig=<ID=NC_006853.1,length=16338,assembly=GCF_002263795.2_ARS-UCD1.3_genomic.fna>
##reference=file:///storage2/anlei/zhaomengmeng/Bovine/reference/GCF_002263795.2_2018/GCF_002263795.2_ARS-UCD1.3_genomic.fna
##source=CombineGVCFs
##source=SelectVariants
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  sperm       ovary1       ovary2      ovary3      ovary4       ovary5       ovary6
NC_037328.1     25776   .       A       C       52.69   PASS    AC=1;AF=0.100;AN=10;BaseQRankSum=0.00;DP=21;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=0.200;MQ=42.62;MQRankSum=-1.111e+00;QD=8.78;ReadPosRankSum=-4.310e-01;SOR=0.307     GT:AD:DP:FT:GQ:PL:FI    ./.:0,0:0:PASS:.:0,0,0:1        0/0NC_037328.1     27151   .       C       A       55.46   PASS    AC=1;AF=0.071;AN=14;BaseQRankSum=0.00;DP=41;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.071;MQ=60.00;MQRankSum=0.00;QD=9.24;ReadPosRankSum=0.00;SOR=0.693 GT:AD:DP:GQ:PL:FI       0/0:7,0:7:21:0,21,172:1 0/1:4,2:6:65:65,0,149:1 0/0NC_037328.1     27273   .       C       T       411.50  PASS    AC=6;AF=0.600;AN=10;BaseQRankSum=0.00;DP=25;ExcessHet=3.6798;FS=23.314;MLEAC=6;MLEAF=0.600;MQ=42.55;MQRankSum=0.00;QD=24.21;ReadPosRankSum=1.96;SOR=1.184       GT:AD:DP:GQ:PL:FI       ./.:0,0:0:.:0,0,0:1     ./.:1,0:1:.:0,0,0:1NC_037328.1     27605   .       T       A       209.09  PASS    AC=3;AF=0.250;AN=12;BaseQRankSum=0.00;DP=41;ExcessHet=4.3933;FS=2.199;MLEAC=4;MLEAF=0.333;MQ=45.19;MQRankSum=0.00;QD=11.00;ReadPosRankSum=-5.660e-01;SOR=1.565  GT:AD:DP:GQ:PL:FI       ./.:0,0:0:.:0,0,0:1     0/1:5,3:8:95:95,0,1NC_037328.1     27764   .       T       A       282.72  PASS    AC=2;AF=0.167;AN=12;BaseQRankSum=1.11;DP=66;ExcessHet=3.4242;FS=0.000;MLEAC=2;MLEAF=0.167;MQ=51.21;MQRankSum=-9.960e-01;QD=10.47;ReadPosRankSum=0.393;SOR=0.892 GT:AD:DP:GQ:PL:FI       ./.:0,0:0:.:0,0,0:1     0/0:9,0:9:27:0,27,3NC_037328.1     27916   .       A       C       168.46  PASS    AC=1;AF=0.083;AN=12;BaseQRankSum=-1.610e-01;DP=73;ExcessHet=3.0103;FS=4.468;MLEAC=2;MLEAF=0.167;MQ=43.56;MQRankSum=1.85;QD=6.74;ReadPosRankSum=-1.697e+00;SOR=2.066     GT:AD:DP:GQ:PGT:PID:PL:PS:FI    ./.:0,0:0:.:.:.:0,0,0:1 0/0`
FelixKrueger commented 8 months ago

Sorry for the late reply, I was travelling all last week, and on top the issue is marked as complete; does that mean you found the answers in the meantime?

Just generally: SNPsplit is designed to use the VCF files of the mouse genome project, which tend to have high confidence homozygous SNP calls for some 50 strains (compared to the reference strain C57BL/6). So the genome calls are typically 0/0 (same as reference), or 1/1, 2/2 for homozygous variants.

In your case, you would have to arrive at a situation where you use ovaries from breed A, and perform variation calling against sperm from breed B. The VCF feel then needs to be in the same format as the one from the mouse genomes project (e.g. only contain homozygous variants of breed B against breed A); alternatively, if you find heterozygous calls (e.g. 0/1 or 1/0) you would have to change the logic within the SNPsplit genome preparation. There are number of closed issues where people have tried a similar approach for non-mouse species, but I am afraid it is a little fiddly as it is not the original intended use case for SNPsplit.

If you wanted to use E. coli as for normalisation in CUT&TAG I don't think this needs any N-masking, but I haven't ever done this myself to be honest.