Cloufield / gwaslab

A Python package for handling and visualizing GWAS summary statistics. https://cloufield.github.io/gwaslab/
GNU General Public License v3.0
162 stars 25 forks source link

get_lead 'boolean value of NA' issue localising to util_in_get_sig.py #103

Closed krishnaxamin closed 4 months ago

krishnaxamin commented 4 months ago

Hi,

Great package - so nice being able to do all this in Python.

Code and error detailed below:

# convert SAIGE data into gwaslab format
saige_gwaslab = saige_corrected.copy()
saige_gwaslab['SNPID'] = saige_gwaslab['CHR'].astype(str) + ':' + saige_gwaslab[
            'POS'].astype(str) + ':' + saige_gwaslab['Allele2'] + ':' + saige_gwaslab['Allele1']
saige_gwaslab.rename(columns={'Allele1': 'NEA', 'Allele2': 'EA', 'adjusted_pval': 'P'}, inplace=True)
saige_gwaslab = saige_gwaslab[['SNPID', 'CHR', 'POS', 'EA', 'NEA', 'P']].copy()
saige_gwaslab['CHR'] = saige_gwaslab['CHR'].astype(str)
saige_gwaslab = saige_gwaslab[(saige_gwaslab['CHR'] != '6') |
                                  (saige_gwaslab['POS'] < 28477797) |
                                  (saige_gwaslab['POS'] > 33448354)].copy()
saige_sumstats = gl.Sumstats(saige_gwaslab, 
                             snpid='SNPID', chrom='CHR', pos='POS', ea='EA', nea='NEA', p='P', 
                             build='19')
2024/07/26 15:53:58 GWASLab v3.4.40 https://cloufield.github.io/gwaslab/
2024/07/26 15:53:58 (C) 2022-2024, Yunye He, Kamatani Lab, MIT License, gwaslab@gmail.com
2024/07/26 15:53:58 Start to initialize gl.Sumstats from pandas DataFrame ...
2024/07/26 15:54:00  -Reading columns          : NEA,EA,POS,SNPID,CHR,P
2024/07/26 15:54:00  -Renaming columns to      : NEA,EA,POS,SNPID,CHR,P
2024/07/26 15:54:00  -Current Dataframe shape : 9749581  x  6
2024/07/26 15:54:01  -Initiating a status column: STATUS ...
2024/07/26 15:54:01  -Genomic coordinates are based on GRCh37/hg19...
2024/07/26 15:54:03 Start to reorder the columns...v3.4.40
2024/07/26 15:54:03  -Current Dataframe shape : 9749581 x 7 ; Memory usage: 487.47 MB
2024/07/26 15:54:03  -Reordering columns to    : SNPID,CHR,POS,EA,NEA,P,STATUS
2024/07/26 15:54:04 Finished reordering the columns.
2024/07/26 15:54:04  -Column  : SNPID  CHR    POS   EA       NEA      P       STATUS  
2024/07/26 15:54:04  -DType   : object string int64 category category float64 category
2024/07/26 15:54:04  -Verified: T      F      T     T        T        T       T       
2024/07/26 15:54:04  #WARNING! Columns with possibly incompatible dtypes: CHR
2024/07/26 15:54:04  -Current Dataframe memory usage: 487.47 MB
2024/07/26 15:54:04 Finished loading data successfully!
saige_lead = saige_sumstats.get_lead(anno=True, build='19', sig_level=7.35e-10)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[19], line 2
      1 # lead variants - SAIGE
----> 2 saige_lead = saige_sumstats.get_lead(anno=True, build='19', sig_level=7.35e-10)
      3 saige_lead

File ~/.conda/envs/gwaslab_venv/lib/python3.9/site-packages/gwaslab/g_Sumstats.py:617, in Sumstats.get_lead(self, build, gls, **args)
    614 if build is None:
    615     build = self.meta["gwaslab"]["genome_build"]
--> 617 output = getsig(self.data,
    618                    id=id_to_use,
    619                    chrom="CHR",
    620                    pos="POS",
    621                    p="P",
    622                    log=self.log,
    623                    build=build,
    624                    **args)
    625 # return sumstats object    
    626 if gls == True:

File ~/.conda/envs/gwaslab_venv/lib/python3.9/site-packages/gwaslab/util_in_get_sig.py:115, in getsig(insumstats, id, chrom, pos, p, mlog10p, scaled, use_p, windowsizekb, sig_level, log, xymt, anno, wc_correction, build, source, verbose)
    112 #iterate through all significant snps
    113 for line_number,(index, row) in enumerate(sumstats_sig.iterrows()):
    114     #when finished one chr 
--> 115     if row[chrom]!=current_sig_chr:
    116         #add the current lead variants id to lead variant list
    117         if current_sig_index is not False:sig_index_list.append(current_sig_index)
    119         #update lead vairant info to the new variant

File missing.pyx:392, in pandas._libs.missing.NAType.__bool__()

TypeError: boolean value of NA is ambiguous

I have checked the 'CHR' values - no NANs present. The code works fine with other summary stats I am using which have the same format.

I am using v3.4.40 as set up using the provided yaml file.

Any tips? I can provide the input datafile if needed.

Thanks!

Cloufield commented 4 months ago

Hi, It is possible that there are some unrecognized notations in your CHR column (like chrX or sonething). Could you try running saige_sumstats.fix_chr() before saige_sumstats.get_lead() ? I think it may detect and fix those unrecognized notations .

krishnaxamin commented 4 months ago

Hi, Thanks for the reply. Unfortunately, the error persists after using fix_chr(). fix_chr() does pick up on 'XY', but as I said the error persists, and the other datasets I am using also have 'XY', with no issues.

Fyi, here are the values within the CHR column of saige_gwaslabbefore it is converted to saige_sumstats.

saige_gwaslab['CHR'].unique()
array(['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19',
       '2', '20', '21', '22', '3', '4', '5', '6', '7', '8', '9', 'X',
       'XY'], dtype=object)
Cloufield commented 4 months ago

Hi, Thanks for the additional information. XY was not recognized by gwaslab and subsequently converted to NA value. I think in this case the solution for now is to manually convert all XY to X. gwaslab does not distinguish PAR and non-PAR regions for chrX. The reason why this error only occurred for this dataset is that this one contains significant associations in XY while others do not.

krishnaxamin commented 4 months ago

Hi, I see - thank you! That's worked now.
I assume, from fix_chr(), that gwaslab wants numeric chromosome codes in general, e.g. it will want '23' rather than 'X' for chrX. If so, it may be nice to have that noted somewhere in the wiki, just for convenience - apologies if it is already there and I missed it. Thanks for your help!