Cloufield / gwaslab

A Python package for handling and visualizing GWAS summary statistics. https://cloufield.github.io/gwaslab/
GNU General Public License v3.0
151 stars 25 forks source link

Brisbane plotting error regarding 'SNPID' #40

Closed swvanderlaan closed 1 year ago

swvanderlaan commented 1 year ago

I am running your script using the following data (with ±77 million rows):

rsID CHR POS EA NEA EAF BETA SE P INFO N STATUS EnsemblID tss_distance ma_count ma_samples
rs12238997 1 693731 G A 0.134185 0.005953 0.039917 0.881492 0.624501 626.000020 1999999 ENSG00000187634 -240519 168 153
rs149887893 1 714596 C T 0.038339 -0.015369 0.073635 0.834746 0.577564 626.000011 1999999 ENSG00000187634 -219654 48 48
rs12184277 1 715367 G A 0.038339 -0.015369 0.073635 0.834746 0.671960 626.000011 1999999 ENSG00000187634 -218883 48 48
rs12184279 1 717485 A C 0.038339 -0.015369 0.073635 0.834746 0.670147 626.000011 1999999 ENSG00000187634 -216765 48 48
rs116801199 1 720381 T G 0.038339 -0.015369 0.073635 0.834746 0.667536 626.000011 1999999 ENSG00000187634 -213869 48 48

This is my command:

eqtl_cis_sumstats.plot_mqq(
    skip=2,
    cut=20,
    sig_line=False,
    anno="GENENAME",
    anno_style="right",
    windowsizekb=100,
    arm_offset=2,
    repel_force=0.01,  # default 0.01
    use_rank=True,
    build="19",
    mode="b",
    # stratified=True,
    figargs={"figsize": (25, 15), "dpi": 100},
    title="cis-eQTLs in carotid plaque",
    save=os.path.join(PLOTS_loc, "brisbane.100kb.300dpi.pdf"),
    saveargs={"dpi": 300},
    verbose=True,
)

This is the log:

Fri Jul  7 14:36:23 2023 Start to plot manhattan/qq plot with the following basic settings:
Fri Jul  7 14:36:23 2023  -Genomic coordinates version: 19...
Fri Jul  7 14:36:23 2023    -WARNING!!! Genomic coordinates version is unknown...
Fri Jul  7 14:36:23 2023  -Genome-wide significance level is set to 1e-05 ...
Fri Jul  7 14:36:23 2023  -Raw input contains 77285276 variants...
Fri Jul  7 14:36:23 2023  -Plot layout mode is : b
Fri Jul  7 14:36:54 2023 Finished loading specified columns from the sumstats.
Fri Jul  7 14:36:54 2023 Start conversion and sanity check:
Fri Jul  7 14:36:57 2023  -Removed 0 variants with nan in CHR or POS column ...
Fri Jul  7 14:36:57 2023  -Calculating DENSITY with windowsize of  100  kb

This is the error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File [~/mambaforge/envs/gwas/lib/python3.8/site-packages/pandas/core/indexes/base.py:3629](https://file+.vscode-resource.vscode-cdn.net/Users/slaan3/git/CirculatoryHealth/molqtl/~/mambaforge/envs/gwas/lib/python3.8/site-packages/pandas/core/indexes/base.py:3629), in Index.get_loc(self, key, method, tolerance)
   3628 try:
-> 3629     return self._engine.get_loc(casted_key)
   3630 except KeyError as err:

File [~/mambaforge/envs/gwas/lib/python3.8/site-packages/pandas/_libs/index.pyx:136](https://file+.vscode-resource.vscode-cdn.net/Users/slaan3/git/CirculatoryHealth/molqtl/~/mambaforge/envs/gwas/lib/python3.8/site-packages/pandas/_libs/index.pyx:136), in pandas._libs.index.IndexEngine.get_loc()

File [~/mambaforge/envs/gwas/lib/python3.8/site-packages/pandas/_libs/index.pyx:163](https://file+.vscode-resource.vscode-cdn.net/Users/slaan3/git/CirculatoryHealth/molqtl/~/mambaforge/envs/gwas/lib/python3.8/site-packages/pandas/_libs/index.pyx:163), in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:5198, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:5206, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'SNPID'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[6], line 1
----> 1 eqtl_cis_sumstats.plot_mqq(
      2     skip=2,
      3     cut=20,
      4     sig_line=False,
      5     sig_level=1e-5,  # 0.05 [/](https://file+.vscode-resource.vscode-cdn.net/) 5000, # Bonferroni correction for number of variants per gene tested, estimated from experience
      6     anno="GENENAME",
      7     anno_style="right",
      8     windowsizekb=100,
      9     arm_offset=2,
     10     repel_force=0.01,  # default 0.01
     11     use_rank=True,
     12     build="19",
     13     mode="b",
     14     # stratified=True,
     15     figargs={"figsize": (25, 15), "dpi": 100},
     16     title="cis-eQTLs in carotid plaque",
     17     save=os.path.join(PLOTS_loc, "brisbane.100kb.300dpi.pdf"),
     18     saveargs={"dpi": 300},
     19     verbose=True,
     20 )

File [~/mambaforge/envs/gwas/lib/python3.8/site-packages/gwaslab/Sumstats.py:460](https://file+.vscode-resource.vscode-cdn.net/Users/slaan3/git/CirculatoryHealth/molqtl/~/mambaforge/envs/gwas/lib/python3.8/site-packages/gwaslab/Sumstats.py:460), in Sumstats.plot_mqq(self, build, **args)
    457 if build is None:
    458     build = self.meta["gwaslab"]["genome_build"]
--> 460 plot = mqqplot(self.data,
    461                snpid=snpid, 
    462                chrom=chrom, 
    463                pos=pos, 
    464                p=p, 
    465                eaf=eaf,
    466                build = build, 
    467                **args)
    469 return plot

File [~/mambaforge/envs/gwas/lib/python3.8/site-packages/gwaslab/mqqplot.py:468](https://file+.vscode-resource.vscode-cdn.net/Users/slaan3/git/CirculatoryHealth/molqtl/~/mambaforge/envs/gwas/lib/python3.8/site-packages/gwaslab/mqqplot.py:468), in mqqplot(insumstats, chrom, pos, p, snpid, eaf, ea, nea, chr_dict, xtick_chr_dict, vcf_path, vcf_chr_dict, gtf_path, gtf_chr_dict, gtf_gene_name, rr_path, rr_header_dict, rr_chr_dict, rr_lim, mlog10p, scaled, mode, scatter_kwargs, qq_scatter_kwargs, qq_line_color, region, region_ref, region_ref2, region_step, region_grid, region_grid_line, region_lead_grid, region_lead_grid_line, region_hspace, region_ld_threshold, region_ld_colors, region_ld_colors1, region_ld_colors2, region_recombination, region_protein_coding, region_flank_factor, region_anno_bbox_args, taf, tabix, mqqratio, bwindowsizekb, density_color, density_range, density_trange, density_threshold, density_tpalette, density_palette, windowsizekb, anno, anno_set, anno_alias, anno_d, anno_args, anno_style, anno_fixed_arm_length, anno_source, anno_adjust, anno_max_iter, arm_offset, arm_scale, arm_scale_d, cut, skip, ystep, ylabels, ytick3, cutfactor, cut_line_color, cut_log, jagged, jagged_len, jagged_wid, sig_line, sig_level, sig_level_lead, sig_line_color, suggestive_sig_line, suggestive_sig_level, suggestive_sig_line_color, additional_line, additional_line_color, sc_linewidth, highlight, highlight_color, highlight_windowkb, highlight_anno_args, pinpoint, pinpoint_color, stratified, maf_bins, maf_bin_colors, gc, include_chrXYMT, ylim, xpad, chrpad, drop_chr_start, title, mtitle, qtitle, title_pad, title_fontsize, fontsize, font_family, anno_fontsize, figargs, colors, marker_size, use_rank, verbose, repel_force, build, dpi, save, saveargs, _invert, expected_min_mlog10p, log)
    466 sumstats = sumstats.sort_values(by="TCHR+POS")
    467 for index,row in sumstats.iterrows():
--> 468     stack.append([row["SNPID"],row["TCHR+POS"],0])  
    469     for i in range(2,len(stack)+1):
    470         if stack[-i][1]>= (row["TCHR+POS"]- 1000*bwindowsizekb):

File [~/mambaforge/envs/gwas/lib/python3.8/site-packages/pandas/core/series.py:958](https://file+.vscode-resource.vscode-cdn.net/Users/slaan3/git/CirculatoryHealth/molqtl/~/mambaforge/envs/gwas/lib/python3.8/site-packages/pandas/core/series.py:958), in Series.__getitem__(self, key)
    955     return self._values[key]
    957 elif key_is_scalar:
--> 958     return self._get_value(key)
    960 if is_hashable(key):
    961     # Otherwise index.get_value will raise InvalidIndexError
    962     try:
    963         # For labels that don't resolve as scalars like tuples and frozensets

File [~/mambaforge/envs/gwas/lib/python3.8/site-packages/pandas/core/series.py:1069](https://file+.vscode-resource.vscode-cdn.net/Users/slaan3/git/CirculatoryHealth/molqtl/~/mambaforge/envs/gwas/lib/python3.8/site-packages/pandas/core/series.py:1069), in Series._get_value(self, label, takeable)
   1066     return self._values[label]
   1068 # Similar to Index.get_value, but we do not fall back to positional
-> 1069 loc = self.index.get_loc(label)
   1070 return self.index._get_values_for_loc(self, loc, label)

File [~/mambaforge/envs/gwas/lib/python3.8/site-packages/pandas/core/indexes/base.py:3631](https://file+.vscode-resource.vscode-cdn.net/Users/slaan3/git/CirculatoryHealth/molqtl/~/mambaforge/envs/gwas/lib/python3.8/site-packages/pandas/core/indexes/base.py:3631), in Index.get_loc(self, key, method, tolerance)
   3629     return self._engine.get_loc(casted_key)
   3630 except KeyError as err:
-> 3631     raise KeyError(key) from err
   3632 except TypeError:
   3633     # If we have a listlike key, _check_indexing_error will raise
   3634     #  InvalidIndexError. Otherwise we fall through and re-raise
   3635     #  the TypeError.
   3636     self._check_indexing_error(key)

KeyError: 'SNPID'
Cloufield commented 1 year ago

Hi, Sorry for the error. It seems that header for snpid is mistakenly hard coded. This will be fixed in 3.4.20. For now, an alternative way here is to just load the sumstats using rsID as SNPID like: sumstats = gl.Sumstats(data, snpid="rsID",...)

Example:

mysumstats = gl.Sumstats("t2d_bbj.txt.gz",
             snpid="rsID",
             chrom="CHR",
             pos="POS",
             p="P")

# plot the density using variants with p<1e-5 with a windowsize of 100kb.

mysumstats.filter_value('P < 1e-5').plot_mqq(
    sig_level=1e-5,    # used for selecting variants for annotation
    sig_line=False,
    anno="GENENAME",
    anno_style="right",
    bwindowsizekb=100,     #bwindowsizekb for setting the windowsize for calculating density 
    arm_offset=2,
    repel_force=0.01,  # default 0.01
    use_rank=True,
    build="19",
    mode="b",
    figargs={"figsize": (25, 15), "dpi": 100},
    title="cis-eQTLs in carotid plaque",
    save="my_plot.pdf",
    saveargs={"dpi": 300},
    verbose=True,
)

image

By the way, brisbane plot is used to show the density of independent signals according to Yengo. et al Nature 2022 .

Please note that if loading the entire dataset, gwaslab will just calculate the density and make a density plot for all variants you included in the Sumstats like the example. (not sure if this is what you expected).
You may need to determine the independent signals (or the set of variants you want to include) first using other methods like conditional analysis and then load the results for plotting.

Cloufield commented 1 year ago

Hi, the error is fixed in v3.4.20. Thanks.