Cloufield / gwaslab

A Python package for handling and visualizing GWAS summary statistics. https://cloufield.github.io/gwaslab/
GNU General Public License v3.0
148 stars 25 forks source link

liftover weird behaviour #104

Open gmauro opened 1 month ago

gmauro commented 1 month ago

Hello @Cloufield I'm getting a weird error when performing a liftover that I can't explain. Gwaslab is unable to lift 4 SNPs, while when I use the liftover librarydirectly it works. Can you help me find out what's going on? Thank you

These are script and data files to reproduce the problem: test.py.txt unmapped_fromLB.txt

2024/08/09 10:15:28 GWASLab v3.4.47 https://cloufield.github.io/gwaslab/ 2024/08/09 10:15:28 (C) 2022-2024, Yunye He, Kamatani Lab, MIT License, gwaslab@gmail.com 2024/08/09 10:15:28 Start to initialize gl.Sumstats from file :unmapped_fromLB.txt 2024/08/09 10:15:28 -Reading columns : EAF,EA,NEA,CHR,POS,MLOG10P,SE,SNPID,BETA 2024/08/09 10:15:28 -Renaming columns to : EAF,EA,NEA,CHR,POS,MLOG10P,SE,SNPID,BETA 2024/08/09 10:15:28 -Current Dataframe shape : 4 x 9 2024/08/09 10:15:28 -Initiating a status column: STATUS ... 2024/08/09 10:15:28 #WARNING! Version of genomic coordinates is unknown... 2024/08/09 10:15:29 -NEAF is specified... 2024/08/09 10:15:29 -Checking if 0<= NEAF <=1 ... 2024/08/09 10:15:29 -Converted NEAF to EAF. 2024/08/09 10:15:29 -Removed 0 variants with bad NEAF. 2024/08/09 10:15:29 Start to reorder the columns...v3.4.47 2024/08/09 10:15:29 -Current Dataframe shape : 4 x 10 ; Memory usage: 19.94 MB 2024/08/09 10:15:29 -Reordering columns to : SNPID,CHR,POS,EA,NEA,EAF,BETA,SE,MLOG10P,STATUS 2024/08/09 10:15:29 Finished reordering the columns. 2024/08/09 10:15:29 -Column : SNPID CHR POS EA NEA EAF BETA SE MLOG10P STATUS 2024/08/09 10:15:29 -DType : object string int64 category category float64 float64 float64 float64 category 2024/08/09 10:15:29 -Verified: T F T T T T T T T T 2024/08/09 10:15:29 #WARNING! Columns with possibly incompatible dtypes: CHR 2024/08/09 10:15:29 -Current Dataframe memory usage: 19.94 MB 2024/08/09 10:15:29 Finished loading data successfully! Gwaslab: Number of SNPs before liftover: 4 2024/08/09 10:15:36 Start to perform liftover...v3.4.47 2024/08/09 10:15:36 -Current Dataframe shape : 4 x 10 ; Memory usage: 19.94 MB 2024/08/09 10:15:36 -Number of threads/cores to use: 3 2024/08/09 10:15:36 -Creating converter : hg19 to hg38 2024/08/09 10:15:36 -Converting variants with status code xxx0xxx :4... 2024/08/09 10:15:36 -Removed unmapped variants: 4 2024/08/09 10:15:36 Start to fix chromosome notation (CHR)...v3.4.47 2024/08/09 10:15:36 -Current Dataframe shape : 0 x 10 ; Memory usage: 3.82 MB 2024/08/09 10:15:36 -Checking CHR data type... 2024/08/09 10:15:36 -Variants with standardized chromosome notation: 0 2024/08/09 10:15:36 -All CHR are already fixed... 2024/08/09 10:15:37 Finished fixing chromosome notation (CHR). 2024/08/09 10:15:37 Start to fix basepair positions (POS)...v3.4.47 2024/08/09 10:15:37 -Current Dataframe shape : 0 x 10 ; Memory usage: 3.82 MB 2024/08/09 10:15:37 -Converting to Int64 data type ... 2024/08/09 10:15:38 -Position bound:(0 , 250,000,000) 2024/08/09 10:15:38 -Removed outliers: 0 2024/08/09 10:15:38 -Removed 0 variants with bad positions. 2024/08/09 10:15:38 Finished fixing basepair positions (POS). 2024/08/09 10:15:38 Finished liftover. Gwaslab: Number of SNPs after liftover: 0

The following instead, making direct use of the Liftover library

19:54784936 has been lifted to [('chr19', 54281081, '+')] 19:54815387 has been lifted to [('chr19', 54304112, '+')] 19:54813762 has been lifted to [('chr19', 54302487, '+')] 22:24334948 has been lifted to [('chr22', 23992755, '+')]

Cloufield commented 1 month ago

Hi @gmauro, thanks for the feedback. I replicated this. But I think the problem might be related to 0/1 coordinate systems (https://www.biostars.org/p/84686/). POS in sumstats is 1-based but for liftover we need 0-based coordinates. (gwaslab converts 1-based POS to 0-based POS and then calls converter, after getting 0-based results, gwaslab converts it back to 1-based.)

When you called liftover directly, you used converter = get_lifter('hg19', 'hg38', one_based=False). one_based=False (the default setting I think) means that your query is 0-based but the sumstats are actually 1-based. By one_based=False, converter["19"][54784936] actually converted the position of the variant 19:54784937:X:X instead of 19:54784936:X:X. Theoretically, you might need to use converter["19"][54784936-1] for 19:54784936 or set one_based=True. But in that case, the results are empty.