Cloufield / gwaslab

A Python package for handling and visualizing GWAS summary statistics. https://cloufield.github.io/gwaslab/
GNU General Public License v3.0
118 stars 22 forks source link

Improve fix for when POS is not on ref_seq #91

Closed sup3rgiu closed 2 months ago

sup3rgiu commented 2 months ago

Improve fix implemented in https://github.com/Cloufield/gwaslab/issues/88.

The previous implementation tried to index a numpy array using a pandas series. This could easily be solved by casting what was needed to a numpy array. However, I think this is a cleaner solution that also avoids the for-loop.

The equivalence between the 's' mode and the 'v' mode can be checked using the following code:

import gwaslab as gl
mysumstats = gl.Sumstats('/path/to/to_harmonize.tsv', fmt="gwaslab", other=["NOTE"])
df_orig = mysumstats.data.copy()

mysumstats.harmonize(
    basic_check=True,
    n_cores=4,
    ref_seq=gl.get_path("ucsc_genome_hg19"),
    #ref_infer=gl.get_path("1kg_eas_hg19"),
    ref_alt_freq="AF",
    ref_seq_mode='s'
)
df_harmonize_s = mysumstats.data.copy()
mysumstats.data = df_orig.copy()

mysumstats.harmonize(
    basic_check=True,
    n_cores=4,
    ref_seq=gl.get_path("ucsc_genome_hg19"),
    #ref_infer=gl.get_path("1kg_eas_hg19"),
    ref_alt_freq="AF",
    ref_seq_mode='v'
)
df_harmonize_v = mysumstats.data.copy()
mysumstats.data = df_orig.copy()

df_harmonize_v['STATUS'].equals(df_harmonize_s['STATUS'])

Where the content of to_harmonize.tsv is:

SNPID   CHR POS EA  NEA EAF BETA    SE  P   N   DIRECTION   NOTE
1:1066403:T:C   1   1066403 C   T   0.5276  0.0043  0.0109  0.6891  191764  ++-+    Clean
1:249250625:T:C 1   249250625   C   T   0.5276  0.0043  0.0109  0.6891  191764  ++-+    Out of bound POS
1:1163266:G:A   1   1163266 G   A   0.8355  0.0122  0.012   0.3081  191764  -+++    Flip
1:1213224:T:A   1   1213224 T   A   0.3345  0.0071  0.0095  0.4528  191764  +++-    Flip + Palindromic
1:3066761:A:T   1   3066761 T   A   0.1245  -0.0066 0.0157  0.6742  191764  -+++    Palindromic + No flip
1:3997271:T:TTTTA   1   3997271 TTTTA   T   0.384   0.0188  0.0135  0.163   191764  ++++    Indel + Both on genome + No flip
1:4007705:TC:T  1   4007705 TC  T   0.4368  0.0306  0.0125  0.01421 191764  ++++    Indel  + Both on genome + Flip
1:5164281:A:AAGAG   1   5164281 AAGAG   A   0.9871  0.0167  0.0454  0.712   191764  -+++    Indel 
1:12799025:C:CCT    1   12799025    C   CCT 0.173   -0.0186 0.0118  0.115   191764  ---+    Indel + Flip
1:19323015:T:A  1   19323015    T   A   0.4389  -0.0165 0.0088  0.05949 191764  ----    Palindromic + Flip + Indistinguishable
1:22841855:A:T  1   22841855    T   A   0.4527  0.008   0.0095  0.3998  191764  -++-    Palindromic + Indistinguishable
3:183629306:T:TTCTC 3   183629306   TTCTC   T   0.0082  -0.0083 0.0555  0.8806  191764  +--+    Indel  + Both on genome + No information
4:99731866:G:C  4   99731866    G   C   0.3883  0.0052  0.0093  0.5736  191764  +-++    Palindromic+ Flip + Different strand (REF G, TOMMO G 0.38)
8:89935201:C:G  8   89935201    G   C   0.3533  0.0103  0.0094  0.2756  191764  ++++    Palindromic + Different strand (REF A; TOMMO G 0.34)
X:4206466:G:C   X   4206466 G   C   0.8727  -0.0055 0.0104  0.5984  191764  -+--    Palindromic + Flip + No information
X:5053229:A:T   X   5053229 T   A   0.9747  0.0218  0.0225  0.3318  191764  ++++    Palindromic + No information
X:155270565:A:TT    X   155270565   TT  A   0.9747  0.0218  0.0225  0.3318  191764  ++++    Out of bound POS

Where there are two variants with POS out of bound.

Cloufield commented 2 months ago

Hi, thank you so much for the fix. I will merge it and change the default mode back to "v" when I upload the next version.