Closed gpertea closed 1 month ago
You can see that the position of the SNP does not change strand:
$ echo -e "chr1\t5935161\t5935162\tx\t0\t+" | liftOver /dev/stdin hg19ToHg38.over.chain.gz /dev/stdout /dev/stderr
Reading liftover chains
Mapping coordinates
chr1 5875101 5875102 x 0 +
Rather what is going on is that the two references represent different alleles:
$ samtools faidx hg19.fa chr1:5935162-5935162
>chr1:5935162-5935162
A
$ samtools faidx hg38.fa chr1:5875102-5875102
>chr1:5875102-5875102
T
This means that the two references represent different alleles at that SNP and therefore the genotype has to change. When the strand is reversed you will find the boolean INFO flag FLIP rather than the integer INFO flag SWAP
Thank you for the clarification -- I see now that I've been unaware of the distinction between flipping and swapping and this is one of those cases where A/T can be both, at first sight.. And got tempted by the former, without checking the chain file - my bad!
Meanwhile I just looked at the sequence around this site in both genome builds which confirmed what you just explained.. that's really a swap, not a flip.
>1|5935154-5935170
GCTGCGCCAGCAGACAA
>chr1|5875094-5875110
GCTGCGCCTGCAGACAA
Thank you for your prompt reply!
Excellent collection of plugins, thank you for your work. About the
liftover
plugin which I started to use heavily in the last few days, unless my understanding of the meaning of GT field notation is flawed, I don't think that the GT fields should be "complemented" when the strand is flipped after conversion.Here is an example of some genotype data with rs1287637 being converted from GRCh37/hg19 (original SNP array data) to hg38. The reference base for this SNV happens to change strands (so it is accordingly marked with SWAP=1 in the INFO field, which is great). The original record looks like this (only first 9 sample columns shown):
After the bcftools +liftover to hg38:
My understanding is that the notation 0/0 refers to the reference base, whatever that is (A in the old genome build, T in the new one). If that reference base changes between genome builds, its meaning should stay the same, correct? i.e. the 0/0 genotype should still be left as 0/0, instead of becoming homozygous for the alt allele, which can alter AF calculations etc.