Open rdey-insitro opened 2 weeks ago
Here is a potential modification to fix this problem:
bcftools query --format '%CHROM\t%POS\t%REF\t%ALT\n' --exclude 'COUNT(GT="het")=0 || COUNT(GT="het" & ((TYPE="snp" & (FORMAT/AD[*:1] / FORMAT/AD[*:0]) > 0.15) | (TYPE="indel" & (FORMAT/AD[*:1] / FORMAT/AD[*:0]) > 0.20)))>0' {{input}} | gzip > {{output}}
Hi @rdey-insitro , thanks for your issue. We've run QC on the UKB WES data without running into the issue of not having any genotypes/variants left, and we also did some dedicated testing of this particular bcftools command. To help us diagnose what is going wrong for you, could you please provide us one of the following?
vcf_files_list
and reference_fasta_file
In case this wasn't clear, also keep in mind as we try to understand the issue that the bcftools command is outputting a TSV of variants that should be filtered out (which is done at a later step in Python).
Looking forward to your reply.
Hi,
Let's use the fake vcf files in the example directory. I am using the same config file example/config/deeprvat_preprocess_config.yaml
provided. I am running the command snakemake -j 1 --snakefile ~/deeprvat/pipelines/preprocess_with_qc.snakefile --configfile /home/ec2-user/deeprvat/example/config/deeprvat_preprocess_config.yaml
.
This is the set of "filtered out" variants due to allelic imbalance on chr 21:
zcat workdir/qc/allelic_imbalance/test_vcf_data_c21_b1.tsv.gz | head
chr21 5033887 G A
chr21 5038298 A G
chr21 5051437 C A
chr21 5051474 T A
chr21 5053701 T A
chr21 10413724 C T
And here is the vcf readouts from chr 21:
zcat /home/ec2-user/deeprvat/example/preprocess/data/vcf/test_vcf_data_c21_b1.vcf.gz
##fileformat=VCFv4.2
##source=VCFake 0.1.0
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##contig=<ID=chr21>
##reference=ftp://ftp.example.com/sample.fa
##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated allele frequency in the range (0,1)">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Phased Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S10000001 S10000002 10000003 10000004 10000005 10000006 10000007 10000008 10000009 10000010
chr21 10413673 . T A . PASS . GT:AD 0/0:10,20 0/0:15,25 0/0:12,18 0/0:22,25 0/0:30,15 0/0:13,17 0/0:40,40 0/0:28,30 0/0:14,18 0/0:10,10
chr21 10413724 . C T . PASS . GT:AD 0/0:20,30 1/0:25,35 0/0:30,40 0/0:35,45 0/0:40,50 0/0:30,35 0/0:45,55 0/0:32,38 0/0:27,33 0/0:20,25
chr21 5033887 . G A . PASS . GT:AD 0/0:40,50 0/1:35,45 0/0:25,30 0/0:20,25 0/0:38,42 0/0:23,27 0/0:35,40 0/0:40,40 0/0:30,30 0/0:25,25
chr21 5035658 . C T . PASS . GT:AD 0/0:30,20 0/0:25,15 0/0:35,30 0/0:40,25 0/0:45,20 0/0:30,15 0/0:28,12 0/0:22,8 0/0:20,10 0/0:15,5
chr21 5038298 . A G . PASS . GT:AD 0/0:40,40 1/0:35,35 0/0:30,30 0/0:25,25 0/0:20,20 0/0:15,15 0/0:10,10 0/0:5,5 0/0:45,45 0/0:50,50
chr21 5038313 . C T . PASS . GT:AD 0/0:20,15 0/0:25,20 0/0:30,25 0/0:35,30 0/0:40,35 0/0:45,40 0/0:50,45 0/0:55,50 0/0:60,55 0/0:65,60
chr21 5051437 . C A . PASS . GT:AD 0/0:30,25 0/1:25,20 1/0:20,15 0/0:15,10 0/0:10,5 0/0:5,0 0/0:0,0 0/0:0,0 0/0:0,0 0/0:0,0
chr21 5051474 . T A . PASS . GT:AD 0/0:20,20 0/0:30,30 0/1:25,25 0/0:35,35 0/0:40,40 0/0:50,50 0/0:45,45 0/0:35,35 0/0:25,25 0/0:20,20
chr21 5052251 . C A . PASS . GT:AD 0/0:30,30 0/0:25,25 0/0:20,20 1/1:15,15 0/0:10,10 0/0:5,5 0/0:0,0 0/0:0,0 0/0:0,0 0/0:0,0
chr21 5053701 . T A . PASS . GT:AD 0/0:20,10 0/0:25,15 0/0:30,20 0/1:35,25 0/0:40,30 0/0:45,35 0/0:50,40 0/0:55,45 0/0:60,50 0/0:65,55
Notice that, all the variants in this vcf file that have at least one heterozygous sample have been filtered out, even though for example chr21 5033887
variant has good allele depth ratio for the one sample where the ALT is present.
After some more digging, found that this might be due to a change in bcftools version. This may explain why previous extensive tests have not found this issue. I am using bcftools v1.17 for the above output which is consistent with the deeprvat_preprocessing_env.yml
file. I played around with different versions of bcftools and found that using both the v1.17 and v1.20 (latest) lead to the same issue, whereas using a much older version v1.10 resolves this issue. I am guessing somewhere between v1.10 and v1.17 bcftools has changed the way it interprets the logicals for the --exclude
filter.
Thanks very much for looking more deeply into this. We'll leave this issue open while we look into a more permanent fix, but it sounds like using v1.10 might be a temporary workaround.
I was trying to QC the 200K UKB WES data using the preprocessing with QC pipeline and I am finding the final genotypes.h5 file to have no genotypes at all (the file size was 8Mb). I did some digging and it seems all the available variants in the bcf file has been filtered out at the allelic imbalance step. The
qc/allelic_imbalance/myfile.vcf.gz
surprisingly includes all the available variant IDs. I am suspecting the following logic issue:Affected file:
pipelines/preprocessing/qc.snakefile
Affected code:bcftools query --format '%CHROM\t%POS\t%REF\t%ALT\n' --exclude 'COUNT(GT="het")=0 || (GT="het" & ((TYPE="snp" & (FORMAT/AD[*:1] / FORMAT/AD[*:0]) > 0.15) | (TYPE="indel" & (FORMAT/AD[*:1] / FORMAT/AD[*:0]) > 0.20)))' {{input}} | gzip > {{output}}
I think this is not doing the allelic imbalance QC properly. The above step in my understanding excludes variant-sample pairs from the bcf query where the above criteria is met. But, almost all variants have at least one sample where the above criteria is not met. Then, that variant would sneak through to the variant list in the allelic_imbalance directory and end up being excluded in the final genotype.h5 file. So, instead of filtering out variants for which all the heterozygous samples have allele imbalance, this step is filtering out variants if any heterozygous sample has allele imbalance.
I manually ran the same bcftools code on the normalized bcf files, and the output contains all the variant IDs. If the
--exclude
flag is replaced by--include
flag, still the output would contain all the variant IDs.Please let me know if it is an issue with the logical criteria above, or there is something I am doing wrong.