freeseek / score

Tools to work with GWAS-VCF summary statistics files
MIT License
94 stars 6 forks source link

liftover: should GT stay the same when strand is flipped? #13

Closed gpertea closed 1 month ago

gpertea commented 1 month ago

Excellent collection of plugins, thank you for your work. About the liftover plugin which I started to use heavily in the last few days, unless my understanding of the meaning of GT field notation is flawed, I don't think that the GT fields should be "complemented" when the strand is flipped after conversion.

Here is an example of some genotype data with rs1287637 being converted from GRCh37/hg19 (original SNP array data) to hg38. The reference base for this SNV happens to change strands (so it is accordingly marked with SWAP=1 in the INFO field, which is great). The original record looks like this (only first 9 sample columns shown):

$ bcftools view -H gt_5M_gencalls_n208.bcf 1:5935162 | cut -f1-5,8,10-18
1   5935162 kgp1477084  A   T   .   1/1 0/1 1/1 0/1 1/1 1/1 1/1 0/1 1/1

After the bcftools +liftover to hg38:

$ bcftools view -H hg38_gt_5M.bcf chr1:5875102 | cut -f1-5,8,10-18
chr1    5875102 kgp1477084  T   A   SWAP=1  0/0 1/0 0/0 1/0 0/0 0/0 0/0 1/0 0/0

My understanding is that the notation 0/0 refers to the reference base, whatever that is (A in the old genome build, T in the new one). If that reference base changes between genome builds, its meaning should stay the same, correct? i.e. the 0/0 genotype should still be left as 0/0, instead of becoming homozygous for the alt allele, which can alter AF calculations etc.

freeseek commented 1 month ago

You can see that the position of the SNP does not change strand:

$ echo -e "chr1\t5935161\t5935162\tx\t0\t+" | liftOver /dev/stdin hg19ToHg38.over.chain.gz /dev/stdout /dev/stderr
Reading liftover chains
Mapping coordinates
chr1    5875101 5875102 x   0   +

Rather what is going on is that the two references represent different alleles:

$ samtools faidx hg19.fa chr1:5935162-5935162
>chr1:5935162-5935162
A
$ samtools faidx hg38.fa chr1:5875102-5875102
>chr1:5875102-5875102
T

This means that the two references represent different alleles at that SNP and therefore the genotype has to change. When the strand is reversed you will find the boolean INFO flag FLIP rather than the integer INFO flag SWAP

gpertea commented 1 month ago

Thank you for the clarification -- I see now that I've been unaware of the distinction between flipping and swapping and this is one of those cases where A/T can be both, at first sight.. And got tempted by the former, without checking the chain file - my bad!

Meanwhile I just looked at the sequence around this site in both genome builds which confirmed what you just explained.. that's really a swap, not a flip.

>1|5935154-5935170
GCTGCGCCAGCAGACAA
>chr1|5875094-5875110
GCTGCGCCTGCAGACAA

Thank you for your prompt reply!