Problem with the logic for allelic imbalance qc step

rdey-insitro commented 2 weeks ago

I was trying to QC the 200K UKB WES data using the preprocessing with QC pipeline and I am finding the final genotypes.h5 file to have no genotypes at all (the file size was 8Mb). I did some digging and it seems all the available variants in the bcf file has been filtered out at the allelic imbalance step. The qc/allelic_imbalance/myfile.vcf.gz surprisingly includes all the available variant IDs. I am suspecting the following logic issue:

Affected file: pipelines/preprocessing/qc.snakefile Affected code: bcftools query --format '%CHROM\t%POS\t%REF\t%ALT\n' --exclude 'COUNT(GT="het")=0 || (GT="het" & ((TYPE="snp" & (FORMAT/AD[*:1] / FORMAT/AD[*:0]) > 0.15) | (TYPE="indel" & (FORMAT/AD[*:1] / FORMAT/AD[*:0]) > 0.20)))' {{input}} | gzip > {{output}}

I think this is not doing the allelic imbalance QC properly. The above step in my understanding excludes variant-sample pairs from the bcf query where the above criteria is met. But, almost all variants have at least one sample where the above criteria is not met. Then, that variant would sneak through to the variant list in the allelic_imbalance directory and end up being excluded in the final genotype.h5 file. So, instead of filtering out variants for which all the heterozygous samples have allele imbalance, this step is filtering out variants if any heterozygous sample has allele imbalance.

I manually ran the same bcftools code on the normalized bcf files, and the output contains all the variant IDs. If the --exclude flag is replaced by --include flag, still the output would contain all the variant IDs.

Please let me know if it is an issue with the logical criteria above, or there is something I am doing wrong.

rdey-insitro commented 2 weeks ago

Here is a potential modification to fix this problem: bcftools query --format '%CHROM\t%POS\t%REF\t%ALT\n' --exclude 'COUNT(GT="het")=0 || COUNT(GT="het" & ((TYPE="snp" & (FORMAT/AD[*:1] / FORMAT/AD[*:0]) > 0.15) | (TYPE="indel" & (FORMAT/AD[*:1] / FORMAT/AD[*:0]) > 0.20)))>0' {{input}} | gzip > {{output}}

bfclarke commented 2 weeks ago

Hi @rdey-insitro , thanks for your issue. We've run QC on the UKB WES data without running into the issue of not having any genotypes/variants left, and we also did some dedicated testing of this particular bcftools command. To help us diagnose what is going wrong for you, could you please provide us one of the following?

An example VCF file that runs into the issue you are describing (it goes without saying that it should not contain real genotypes 🙂), or
An explicit list of steps that you have carried out (including descriptions of where you got necessary input files, e.g., your vcf_files_list and reference_fasta_file

In case this wasn't clear, also keep in mind as we try to understand the issue that the bcftools command is outputting a TSV of variants that should be filtered out (which is done at a later step in Python).

Looking forward to your reply.

rdey-insitro commented 2 weeks ago

Hi,

Let's use the fake vcf files in the example directory. I am using the same config file example/config/deeprvat_preprocess_config.yaml provided. I am running the command snakemake -j 1 --snakefile ~/deeprvat/pipelines/preprocess_with_qc.snakefile --configfile /home/ec2-user/deeprvat/example/config/deeprvat_preprocess_config.yaml.

This is the set of "filtered out" variants due to allelic imbalance on chr 21:

zcat workdir/qc/allelic_imbalance/test_vcf_data_c21_b1.tsv.gz | head
chr21   5033887 G       A
chr21   5038298 A       G
chr21   5051437 C       A
chr21   5051474 T       A
chr21   5053701 T       A
chr21   10413724        C       T

And here is the vcf readouts from chr 21:

zcat /home/ec2-user/deeprvat/example/preprocess/data/vcf/test_vcf_data_c21_b1.vcf.gz
##fileformat=VCFv4.2
##source=VCFake 0.1.0
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##contig=<ID=chr21>
##reference=ftp://ftp.example.com/sample.fa
##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated allele frequency in the range (0,1)">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Phased Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  S10000001       S10000002       10000003        10000004        10000005        10000006     10000007 10000008        10000009        10000010
chr21   10413673        .       T       A       .       PASS    .       GT:AD   0/0:10,20       0/0:15,25       0/0:12,18       0/0:22,25       0/0:30,15       0/0:13,17     0/0:40,40       0/0:28,30       0/0:14,18       0/0:10,10
chr21   10413724        .       C       T       .       PASS    .       GT:AD   0/0:20,30       1/0:25,35       0/0:30,40       0/0:35,45       0/0:40,50       0/0:30,35     0/0:45,55       0/0:32,38       0/0:27,33       0/0:20,25
chr21   5033887 .       G       A       .       PASS    .       GT:AD   0/0:40,50       0/1:35,45       0/0:25,30       0/0:20,25       0/0:38,42       0/0:23,27    0/0:35,40        0/0:40,40       0/0:30,30       0/0:25,25
chr21   5035658 .       C       T       .       PASS    .       GT:AD   0/0:30,20       0/0:25,15       0/0:35,30       0/0:40,25       0/0:45,20       0/0:30,15    0/0:28,12        0/0:22,8        0/0:20,10       0/0:15,5
chr21   5038298 .       A       G       .       PASS    .       GT:AD   0/0:40,40       1/0:35,35       0/0:30,30       0/0:25,25       0/0:20,20       0/0:15,15    0/0:10,10        0/0:5,5 0/0:45,45       0/0:50,50
chr21   5038313 .       C       T       .       PASS    .       GT:AD   0/0:20,15       0/0:25,20       0/0:30,25       0/0:35,30       0/0:40,35       0/0:45,40    0/0:50,45        0/0:55,50       0/0:60,55       0/0:65,60
chr21   5051437 .       C       A       .       PASS    .       GT:AD   0/0:30,25       0/1:25,20       1/0:20,15       0/0:15,10       0/0:10,5        0/0:5,0 0/0:0,0       0/0:0,0 0/0:0,0 0/0:0,0
chr21   5051474 .       T       A       .       PASS    .       GT:AD   0/0:20,20       0/0:30,30       0/1:25,25       0/0:35,35       0/0:40,40       0/0:50,50    0/0:45,45        0/0:35,35       0/0:25,25       0/0:20,20
chr21   5052251 .       C       A       .       PASS    .       GT:AD   0/0:30,30       0/0:25,25       0/0:20,20       1/1:15,15       0/0:10,10       0/0:5,5 0/0:0,0       0/0:0,0 0/0:0,0 0/0:0,0
chr21   5053701 .       T       A       .       PASS    .       GT:AD   0/0:20,10       0/0:25,15       0/0:30,20       0/1:35,25       0/0:40,30       0/0:45,35    0/0:50,40        0/0:55,45       0/0:60,50       0/0:65,55

Notice that, all the variants in this vcf file that have at least one heterozygous sample have been filtered out, even though for example chr21 5033887 variant has good allele depth ratio for the one sample where the ALT is present.

rdey-insitro commented 2 weeks ago

After some more digging, found that this might be due to a change in bcftools version. This may explain why previous extensive tests have not found this issue. I am using bcftools v1.17 for the above output which is consistent with the deeprvat_preprocessing_env.yml file. I played around with different versions of bcftools and found that using both the v1.17 and v1.20 (latest) lead to the same issue, whereas using a much older version v1.10 resolves this issue. I am guessing somewhere between v1.10 and v1.17 bcftools has changed the way it interprets the logicals for the --exclude filter.

bfclarke commented 1 week ago

Thanks very much for looking more deeply into this. We'll leave this issue open while we look into a more permanent fix, but it sounds like using v1.10 might be a temporary workaround.

PMBio / deeprvat

Problem with the logic for allelic imbalance qc step #134