Cloufield / gwaslab

A Python package for handling and visualizing GWAS summary statistics. https://cloufield.github.io/gwaslab/
GNU General Public License v3.0
119 stars 22 forks source link

Can't get regional plot working when using data in build HG38 #11

Open h-leonard opened 1 year ago

h-leonard commented 1 year ago

I'm trying to use the mode ='r' for a regional plot with plot_mqq(). I downloaded the gtf and vcf files that are in hg38 from the gwaslab Github using gl.download_ref. However, I can't seem to get these files to work as when I try to call them with gl.get_path() it gives me errors like "Not a gzipped file (b'<!')" Here is one of my many outputs and commands, I've tried every combination of files and flags:

sumstats.plot_mqq(mode="r", region=(14,700000,7002500),region_grid=True, gtf_path=gl.get_path("ensembl_hg38_gtf"))

Wed Apr 12 12:42:08 2023 Start to plot manhattan/qq plot with the following basic settings: Wed Apr 12 12:42:08 2023 -Genome-wide significance level is set to 5e-08 ... Wed Apr 12 12:42:08 2023 -Raw input contains 8565559 variants... Wed Apr 12 12:42:08 2023 -Plot layout mode is : r Wed Apr 12 12:42:08 2023 -Region to plot : chr14:700000-7002500. Wed Apr 12 12:42:11 2023 -Extract SNPs in region : chr14:700000-7002500... Wed Apr 12 12:42:15 2023 -Extract SNPs in specified regions: 1586 Wed Apr 12 12:42:15 2023 Finished loading specified columns from the sumstats. Wed Apr 12 12:42:15 2023 Start conversion and sanity check: Wed Apr 12 12:42:15 2023 -Removed 0 variants with nan in CHR or POS column ... Wed Apr 12 12:42:15 2023 -Removed 0 variants with nan in P column ... Wed Apr 12 12:42:15 2023 -P values are being converted to -log10(P)... Wed Apr 12 12:42:15 2023 -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed... Wed Apr 12 12:42:15 2023 -Sanity check: 0 na/inf/-inf variants will be removed... Wed Apr 12 12:42:16 2023 -Maximum -log10(P) values is 10.427930376302866 . Wed Apr 12 12:42:16 2023 Finished data conversion and sanity check. Wed Apr 12 12:42:16 2023 Start to create manhattan plot with 1586 variants: Wed Apr 12 12:42:16 2023 -Loading gtf files from:/home/leonardhl/.local/lib/python3.9/site-packages/gwaslab/data/Homo_sapiens.GRCh38.108.chr.gtf.gz

BadGzipFile Traceback (most recent call last) Cell In[56], line 1 ----> 1 sumstats.plot_mqq(mode="r", region=(14,700000,7002500),region_grid=True, gtf_path=gl.get_path("ensembl_hg38_gtf"))

File ~/.local/lib/python3.9/site-packages/gwaslab/Sumstats.py:396, in Sumstats.plot_mqq(self, args) 393 else: 394 eaf=None --> 396 plot = mqqplot(self.data,snpid=snpid, chrom=chrom, pos=pos, p=p, eaf=eaf,args) 398 return plot

File ~/.local/lib/python3.9/site-packages/gwaslab/mqqplot.py:579, in mqqplot(insumstats, chrom, pos, p, snpid, eaf, chr_dict, vcf_path, vcf_chr_dict, gtf_path, gtf_chr_dict, gtf_gene_name, rr_path, rr_header_dict, mlog10p, scaled, mode, scatter_kwargs, region, region_step, region_grid, region_grid_line, region_lead_grid, region_lead_grid_line, region_hspace, region_ld_threshold, region_ld_colors, region_recombination, region_protein_coding, region_flank_factor, region_anno_bbox_args, taf, tabix, mqqratio, bwindowsizekb, large_number, windowsizekb, anno, anno_set, anno_alias, anno_d, anno_args, anno_fixed_arm_length, anno_source, arm_offset, arm_scale, arm_scale_d, cut, skip, cutfactor, cut_line_color, sig_line, sig_level, sig_line_color, suggestive_sig_level, highlight, highlight_color, highlight_windowkb, pinpoint, pinpoint_color, stratified, maf_bins, maf_bin_colors, gc, include_chrXYMT, title, mtitle, qtitle, figargs, fontsize, colors, marker_size, use_rank, verbose, repel_force, build, title_pad, dpi, save, saveargs, _invert, log) 576 if (gtf_path is not None ) and ("r" in mode): 577 # load gtf 578 if verbose: log.write(" -Loading gtf files from:" + gtf_path) --> 579 uniq_gene_region,exons = process_gtf(gtf_path = gtf_path , 580 region = region, 581 region_flank_factor = region_flank_factor, 582 build=build, 583 region_protein_coding=region_protein_coding, 584 gtf_chr_dict=gtf_chr_dict, 585 gtf_gene_name=gtf_gene_name) 587 n_uniq_stack = uniq_gene_region["stack"].nunique() 588 stack_num_to_plot = max(taf[0],n_uniq_stack)

File ~/.local/lib/python3.9/site-packages/gwaslab/mqqplot.py:1069, in process_gtf(gtf_path, region, region_flank_factor, build, region_protein_coding, gtf_chr_dict, gtf_gene_name) 1067 gtf = get_gtf(chrom=to_query_chrom,build=build,source="refseq") 1068 else: -> 1069 gtf = pd.read_csv(gtf_path,sep="\t",header=None, comment="#",low_memory=False,dtype={0:"string"}) 1070 gtf = gtf.loc[gtf[0]==gtf_chr_dict[region[0]],:] 1071 #filter in region 1072 #gtf_chr_dict

File /usr/local/Anaconda/envs/py3.9/lib/python3.9/site-packages/pandas/util/_decorators.py:211, in deprecate_kwarg.._deprecate_kwarg..wrapper(*args, *kwargs) 209 else: 210 kwargs[new_arg_name] = new_arg_value --> 211 return func(args, **kwargs)

File /usr/local/Anaconda/envs/py3.9/lib/python3.9/site-packages/pandas/util/_decorators.py:331, in deprecate_nonkeyword_arguments..decorate..wrapper(*args, *kwargs) 325 if len(args) > num_allow_args: 326 warnings.warn( 327 msg.format(arguments=_format_argument_list(allow_args)), 328 FutureWarning, 329 stacklevel=find_stack_level(), 330 ) --> 331 return func(args, **kwargs)

File /usr/local/Anaconda/envs/py3.9/lib/python3.9/site-packages/pandas/io/parsers/readers.py:950, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options) 935 kwds_defaults = _refine_defaults_read( 936 dialect, 937 delimiter, (...) 946 defaults={"delimiter": ","}, 947 ) 948 kwds.update(kwds_defaults) --> 950 return _read(filepath_or_buffer, kwds)

File /usr/local/Anaconda/envs/py3.9/lib/python3.9/site-packages/pandas/io/parsers/readers.py:605, in _read(filepath_or_buffer, kwds) 602 _validate_names(kwds.get("names", None)) 604 # Create the parser. --> 605 parser = TextFileReader(filepath_or_buffer, **kwds) 607 if chunksize or iterator: 608 return parser

File /usr/local/Anaconda/envs/py3.9/lib/python3.9/site-packages/pandas/io/parsers/readers.py:1442, in TextFileReader.init(self, f, engine, **kwds) 1439 self.options["has_index_names"] = kwds["has_index_names"] 1441 self.handles: IOHandles | None = None -> 1442 self._engine = self._make_engine(f, self.engine)

File /usr/local/Anaconda/envs/py3.9/lib/python3.9/site-packages/pandas/io/parsers/readers.py:1753, in TextFileReader._make_engine(self, f, engine) 1750 raise ValueError(msg) 1752 try: -> 1753 return mapping[engine](f, **self.options) 1754 except Exception: 1755 if self.handles is not None:

File /usr/local/Anaconda/envs/py3.9/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py:79, in CParserWrapper.init(self, src, kwds) 76 kwds.pop(key, None) 78 kwds["dtype"] = ensure_dtype_objs(kwds.get("dtype", None)) ---> 79 self._reader = parsers.TextReader(src, kwds) 81 self.unnamed_cols = self._reader.unnamed_cols 83 # error: Cannot determine type of 'names'

File /usr/local/Anaconda/envs/py3.9/lib/python3.9/site-packages/pandas/_libs/parsers.pyx:547, in pandas._libs.parsers.TextReader.cinit()

File /usr/local/Anaconda/envs/py3.9/lib/python3.9/site-packages/pandas/_libs/parsers.pyx:752, in pandas._libs.parsers.TextReader._get_header()

File /usr/local/Anaconda/envs/py3.9/lib/python3.9/site-packages/pandas/_libs/parsers.pyx:852, in pandas._libs.parsers.TextReader._tokenize_rows()

File /usr/local/Anaconda/envs/py3.9/lib/python3.9/site-packages/pandas/_libs/parsers.pyx:1965, in pandas._libs.parsers.raise_parser_error()

File /usr/local/Anaconda/envs/py3.9/lib/python3.9/_compression.py:68, in DecompressReader.readinto(self, b) 66 def readinto(self, b): 67 with memoryview(b) as view, view.cast("B") as byte_view: ---> 68 data = self.read(len(byte_view)) 69 byte_view[:len(data)] = data 70 return len(data)

File /usr/local/Anaconda/envs/py3.9/lib/python3.9/gzip.py:487, in _GzipReader.read(self, size) 483 if self._new_member: 484 # If the _new_member flag is set, we have to 485 # jump to the next member, if there is one. 486 self._init_read() --> 487 if not self._read_gzip_header(): 488 self._size = self._pos 489 return b""

File /usr/local/Anaconda/envs/py3.9/lib/python3.9/gzip.py:435, in _GzipReader._read_gzip_header(self) 432 return False 434 if magic != b'\037\213': --> 435 raise BadGzipFile('Not a gzipped file (%r)' % magic) 437 (method, flag, 438 self._last_mtime) = struct.unpack("<BBIxx", self._read_exact(8)) 439 if method != 8:

BadGzipFile: Not a gzipped file (b'<!')

Cloufield commented 1 year ago

Hi, Sorry that the current documents might be outdated. (I will update soon.) If using gwaslab gtf, there is no need to use gtf_path=gl.get_path("ensembl_hg38_gtf"), just specify build="38". gwaslab will automatically download and process the gtf files (raw gtf files need formatting first) like:

mysumstats.plot_mqq(skip=2,
                    mode="r",
                    region=(7,126253550,128253550),
                    build="38")

Thu Apr 13 10:21:15 2023 Start to plot manhattan/qq plot with the following basic settings:
Thu Apr 13 10:21:15 2023  -Genome-wide significance level is set to 5e-08 ...
Thu Apr 13 10:21:15 2023  -Raw input contains 12557761 variants...
Thu Apr 13 10:21:15 2023  -Plot layout mode is : r
Thu Apr 13 10:21:15 2023  -Region to plot : chr7:126253550-128253550.
Thu Apr 13 10:21:22 2023  -Extract SNPs in region : chr7:126253550-128253550...
Thu Apr 13 10:22:27 2023  -Extract SNPs in specified regions: 8087
Thu Apr 13 10:22:31 2023 Finished loading specified columns from the sumstats.
Thu Apr 13 10:22:31 2023 Start conversion and sanity check:
Thu Apr 13 10:22:31 2023  -Removed 0 variants with nan in CHR or POS column ...
Thu Apr 13 10:22:31 2023  -Removed 0 variants with nan in P column ...
Thu Apr 13 10:22:31 2023  -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed...
Thu Apr 13 10:22:31 2023  -sumstats P values are being converted to -log10(P)...
Thu Apr 13 10:22:31 2023  -Sanity check: 0 na/inf/-inf variants will be removed...
Thu Apr 13 10:22:33 2023  -Maximum -log10(P) values is 73.38711023071251 .
Thu Apr 13 10:22:33 2023 Finished data conversion and sanity check.
Thu Apr 13 10:22:33 2023 Start to create manhattan plot with 1600 variants:
Thu Apr 13 10:22:33 2023 Start to download  recombination_hg38  ...
Thu Apr 13 10:22:33 2023  -Downloading to: /Users/he/.gwaslab/recombination/hg38/recombination_hg38.tar.gz
Thu Apr 13 10:22:37 2023  -Updating record in config file...
Thu Apr 13 10:22:37 2023 Downloaded  recombination_hg38  successfully!
Thu Apr 13 10:22:37 2023  -Loading gtf files from:default
Thu Apr 13 10:22:37 2023 No records in config file. Please download first.
Thu Apr 13 10:22:37 2023 Start to download  ensembl_hg38_gtf  ...
Thu Apr 13 10:22:37 2023  -Downloading to: /Users/he/.gwaslab/Homo_sapiens.GRCh38.109.chr.gtf.gz
Thu Apr 13 10:25:05 2023  -Updating record in config file...
Thu Apr 13 10:25:05 2023 Downloaded  ensembl_hg38_gtf  successfully!
INFO:root:Extracted GTF attributes: ['gene_id', 'gene_name', 'gene_biotype']
Thu Apr 13 10:27:14 2023  -plotting gene track..
Thu Apr 13 10:27:15 2023  -Finished plotting gene track..
Thu Apr 13 10:27:18 2023  -Found 1 significant variants with a sliding window size of 500 kb...
Thu Apr 13 10:27:18 2023 Finished creating Manhattan plot successfully!
Thu Apr 13 10:27:18 2023  -Skip annotating

image

For LD information, just add your "vcf_path".

#download reference vcf if not yet
gl.download_ref("1kg_eas_hg38")

mysumstats.plot_mqq(skip=2,
                    mode="r",
                    region=(7,126253550,128253550),
                    build="38",
                    vcf_path=gl.get_path("1kg_eas_hg38"))

Thu Apr 13 10:35:11 2023 Start to download  1kg_eas_hg38  ...
Thu Apr 13 10:35:11 2023  -Downloading to: /Users/he/.gwaslab/EAS.ALL.split_norm_af.1kg_30x.hg38.vcf.gz
Thu Apr 13 10:36:49 2023  -Updating record in config file...
Thu Apr 13 10:37:18 2023  -Updating record in config file...
Thu Apr 13 10:37:18 2023  -Downloading to: /Users/he/.gwaslab/EAS.ALL.split_norm_af.1kg_30x.hg38.vcf.gz.tbi
Thu Apr 13 10:37:18 2023 Downloaded  1kg_eas_hg38  successfully!
Thu Apr 13 10:37:18 2023 Start to plot manhattan/qq plot with the following basic settings:
...