Cloufield / gwaslab

A Python package for handling and visualizing GWAS summary statistics. https://cloufield.github.io/gwaslab/
GNU General Public License v3.0
160 stars 25 forks source link

error when setting anno="GENENAME" #84

Open Sheeya-Dong opened 8 months ago

Sheeya-Dong commented 8 months ago

When I set "anno=True", the manhattan plot was created succssfully. However, I only changed the anno to "GENENAME", the error occured: Traceback (most recent call last): File "/home/shulab/.local/lib/python3.11/site-packages/pandas/core/indexes/base.py", line 3629, in get_loc return self._engine.get_loc(casted_key) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "pandas/_libs/index.pyx", line 136, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 163, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'Annotation'

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/lab/dxy/manhattan.py", line 20, in mysumstats.plot_mqq(mode="m", File "/home/lab/.local/lib/python3.11/site-packages/gwaslab/Sumstats.py", line 476, in plot_mqq plot = mqqplot(self.data, ^^^^^^^^^^^^^^^^^^ File "/home/lab/.local/lib/python3.11/site-packages/gwaslab/mqqplot.py", line 855, in mqqplot ax1 = annotate_single( ^^^^^^^^^^^^^^^^ File "/home/lab/.local/lib/python3.11/site-packages/gwaslab/annotateplot.py", line 123, in annotate_single annotation_text=row["Annotation"]


  File "/home/lab/.local/lib/python3.11/site-packages/pandas/core/series.py", line 958, in __getitem__
    return self._get_value(key)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/lab/.local/lib/python3.11/site-packages/pandas/core/series.py", line 1069, in _get_value
    loc = self.index.get_loc(label)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lab/.local/lib/python3.11/site-packages/pandas/core/indexes/base.py", line 3631, in get_loc
    raise KeyError(key) from err
KeyError: 'Annotation'
Any idea? Thanks a lot!
Sheeya-Dong commented 8 months ago

And my command is: mysumstats.plot_mqq(mode="m", skip=3, build="19", sig_level=2e-10, sig_level_lead=2e-10, sig_line_color="grey", anno="GENENAME", additional_line=[5e-8], additional_line_color=["lightgrey"])

Cloufield commented 8 months ago

Hi, using the latest version v3.4.41, your script works well on my sample dataset.

image

But I got a similar error if build="19" is not specified.

image

Would you please check if build="19" is specified in your script when the error occurred? If the error is still there, please let me know your gwaslab version and full error log (including the gwaslab log) for debugging. Thanks!

Sheeya-Dong commented 8 months ago

Thanks for your response. The error is still here when I specified build="19". The log is: Fri Mar 22 15:39:14 2024 GWASLab v3.4.24 https://cloufield.github.io/gwaslab/ Fri Mar 22 15:39:14 2024 (C) 2022-2023, Yunye He, Kamatani Lab, MIT License, gwaslab@gmail.com Fri Mar 22 15:39:14 2024 Start to load format from formatbook.... Fri Mar 22 15:39:14 2024 -plink2 format meta info: Fri Mar 22 15:39:14 2024 - format_name : PLINK2 .glm.firth, .glm.logistic,.glm.linear Fri Mar 22 15:39:14 2024 - format_source : https://www.cog-genomics.org/plink/2.0/formats Fri Mar 22 15:39:14 2024 - format_version : Alpha 3.3 final (3 Jun) Fri Mar 22 15:39:14 2024 - last_check_date : 20220806 Fri Mar 22 15:39:14 2024 -plink2 to gwaslab format dictionary: Fri Mar 22 15:39:14 2024 - plink2 keys: ID,#CHROM,POS,REF,ALT,A1,OBS_CT,A1_FREQ,BETA,LOG(OR)_SE,SE,T_STAT,Z_STAT,P,LOG10_P,MACH_R2,OR Fri Mar 22 15:39:14 2024 - gwaslab values: SNPID,CHR,POS,REF,ALT,EA,N,EAF,BETA,SE,SE,T,Z,P,MLOG10P,INFO,OR Fri Mar 22 15:39:14 2024 Start to initiate from file :/home/shulab/dxy2/result/univariate_nodal_set2/gwas_result_ne_min_p.txt Fri Mar 22 15:39:21 2024 -Reading columns : T_STAT,P,SE,OBS_CT,A1,POS,A1_FREQ,#CHROM,BETA,ID,ALT,REF Fri Mar 22 15:39:21 2024 -Renaming columns to : T,P,SE,N,EA,POS,EAF,CHR,BETA,SNPID,ALT,REF Fri Mar 22 15:39:21 2024 -Current Dataframe shape : 7301482 x 12 Fri Mar 22 15:39:21 2024 -Initiating a status column: STATUS ... Fri Mar 22 15:39:22 2024 NEA not available: assigning REF to NEA... Fri Mar 22 15:39:22 2024 -EA,REF and ALT columns are available: assigning NEA... Fri Mar 22 15:39:22 2024 -For variants with EA == ALT : assigning REF to NEA ... Fri Mar 22 15:39:23 2024 -For variants with EA != ALT : assigning ALT to NEA ... Fri Mar 22 15:39:24 2024 Start to reorder the columns... Fri Mar 22 15:39:24 2024 -Current Dataframe shape : 7301482 x 14 Fri Mar 22 15:39:24 2024 -Reordering columns to : SNPID,CHR,POS,EA,NEA,EAF,BETA,SE,P,N,STATUS,REF,ALT,T Fri Mar 22 15:39:24 2024 Finished sorting columns successfully! Fri Mar 22 15:39:25 2024 -Column: SNPID CHR POS EA NEA EAF BETA SE P N STATUS REF ALT T Fri Mar 22 15:39:25 2024 -DType : object int64 int64 category category float64 float64 float64 float64 int64 category category category float64 Fri Mar 22 15:39:25 2024 Finished loading data successfully! Fri Mar 22 15:39:25 2024 Start to plot manhattan/qq plot with the following basic settings: Fri Mar 22 15:39:25 2024 -Genomic coordinates version: 19... Fri Mar 22 15:39:25 2024 -Genome-wide significance level is set to 2e-10 ... Fri Mar 22 15:39:25 2024 -Raw input contains 7301482 variants... Fri Mar 22 15:39:25 2024 -Plot layout mode is : m Fri Mar 22 15:39:34 2024 Finished loading specified columns from the sumstats. Fri Mar 22 15:39:34 2024 Start conversion and sanity check: Fri Mar 22 15:39:34 2024 -Removed 0 variants with nan in CHR or POS column ... Fri Mar 22 15:39:36 2024 -Removed 0 varaints with CHR <=0... Fri Mar 22 15:39:36 2024 -Removed 0 variants with nan in P column ... Fri Mar 22 15:39:37 2024 -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed... Fri Mar 22 15:39:37 2024 -Sumstats P values are being converted to -log10(P)... Fri Mar 22 15:39:37 2024 -Sanity check: 0 na/inf/-inf variants will be removed... Fri Mar 22 15:39:38 2024 -Maximum -log10(P) values is 21.248236819748197 . Fri Mar 22 15:39:38 2024 Finished data conversion and sanity check. Fri Mar 22 15:39:39 2024 Start to create manhattan plot with 942204 variants: Fri Mar 22 15:39:41 2024 -Found 24 significant variants with a sliding window size of 500 kb... Fri Mar 22 15:39:41 2024 Start to annotate variants with nearest gene name(s)... Fri Mar 22 15:39:41 2024 -Assigning Gene name using ensembl_hg19_gtf for protein coding genes Fri Mar 22 15:39:41 2024 No records in config file. Please download first. Fri Mar 22 15:39:41 2024 Start to download ensembl_hg19_gtf ... Fri Mar 22 15:39:41 2024 -Downloading to: /home/lab/.gwaslab/Homo_sapiens.GRCh37.87.chr.gtf.gz Traceback (most recent call last): File "/home/lab/anaconda3/envs/dxy/lib/python3.11/site-packages/urllib3/connection.py", line 203, in _new_conn sock = connection.create_connection( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lab/anaconda3/envs/dxy/lib/python3.11/site-packages/urllib3/util/connection.py", line 85, in create_connection raise err File "/home/lab/anaconda3/envs/dxy/lib/python3.11/site-packages/urllib3/util/connection.py", line 73, in create_connection sock.connect(sa) TimeoutError: timed out

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/lab/anaconda3/envs/dxy/lib/python3.11/site-packages/urllib3/connectionpool.py", line 790, in urlopen response = self._make_request( ^^^^^^^^^^^^^^^^^^^ File "/home/lab/anaconda3/envs/dxy/lib/python3.11/site-packages/urllib3/connectionpool.py", line 491, in _make_request raise new_e File "/home/lab/anaconda3/envs/dxy/lib/python3.11/site-packages/urllib3/connectionpool.py", line 467, in _make_request self._validate_conn(conn) File "/home/lab/anaconda3/envs/dxy/lib/python3.11/site-packages/urllib3/connectionpool.py", line 1092, in _validate_conn conn.connect() File "/home/lab/anaconda3/envs/dxy/lib/python3.11/site-packages/urllib3/connection.py", line 611, in connect self.sock = sock = self._new_conn() ^^^^^^^^^^^^^^^^ File "/home/lab/anaconda3/envs/dxy/lib/python3.11/site-packages/urllib3/connection.py", line 212, in _new_conn raise ConnectTimeoutError( urllib3.exceptions.ConnectTimeoutError: (<urllib3.connection.HTTPSConnection object at 0x7f069ccef750>, 'Connection to ftp.ensembl.org timed out. (connect timeout=20)')

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/lab/anaconda3/envs/dxy/lib/python3.11/site-packages/requests/adapters.py", line 486, in send resp = conn.urlopen( ^^^^^^^^^^^^^ File "/home/lab/anaconda3/envs/dxy/lib/python3.11/site-packages/urllib3/connectionpool.py", line 844, in urlopen retries = retries.increment( ^^^^^^^^^^^^^^^^^^ File "/home/lab/anaconda3/envs/dxy/lib/python3.11/site-packages/urllib3/util/retry.py", line 515, in increment raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='ftp.ensembl.org', port=443): Max retries exceeded with url: /pub/grch37/current/gtf/homo_sapiens/Homo_sapiens.GRCh37.87.chr.gtf.gz (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f069ccef750>, 'Connection to ftp.ensembl.org timed out. (connect timeout=20)'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/lab/dxy/manhattan.py", line 20, in mysumstats.plot_mqq(mode="m", File "/home/lab/.local/lib/python3.11/site-packages/gwaslab/Sumstats.py", line 476, in plot_mqq plot = mqqplot(self.data, ^^^^^^^^^^^^^^^^^^ File "/home/lab/.local/lib/python3.11/site-packages/gwaslab/mqqplot.py", line 815, in mqqplot to_annotate = annogene(to_annotate, ^^^^^^^^^^^^^^^^^^^^^ File "/home/lab/.local/lib/python3.11/site-packages/gwaslab/getsig.py", line 253, in annogene gtf_path = check_and_download("ensembl_hg19_gtf") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lab/.local/lib/python3.11/site-packages/gwaslab/download.py", line 356, in check_and_download download_ref(name,directory = dir_path) File "/home/lab/.local/lib/python3.11/site-packages/gwaslab/download.py", line 218, in download_ref download_file(url,local_path) File "/home/lab/.local/lib/python3.11/site-packages/gwaslab/download.py", line 316, in download_file with requests.get(url, stream=True,timeout=(20, 20)) as r: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lab/anaconda3/envs/dxy/lib/python3.11/site-packages/requests/api.py", line 73, in get return request("get", url, params=params, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lab/anaconda3/envs/dxy/lib/python3.11/site-packages/requests/api.py", line 59, in request return session.request(method=method, url=url, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lab/anaconda3/envs/dxy/lib/python3.11/site-packages/requests/sessions.py", line 589, in request resp = self.send(prep, send_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lab/anaconda3/envs/dxy/lib/python3.11/site-packages/requests/sessions.py", line 703, in send r = adapter.send(request, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lab/anaconda3/envs/dxy/lib/python3.11/site-packages/requests/adapters.py", line 507, in send raise ConnectTimeout(e, request=request) requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='ftp.ensembl.org', port=443): Max retries exceeded with url: /pub/grch37/current/gtf/homo_sapiens/Homo_sapiens.GRCh37.87.chr.gtf.gz (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f069ccef750>, 'Connection to ftp.ensembl.org timed out. (connect timeout=20)'))

Cloufield commented 8 months ago

This is a different error due to poor connections when downloading reference files. Since you haven't downloaded Homo_sapiens.GRCh37.87.chr.gtf.gz from ensemble FTP site, which is the reference file used for annotating gene names, gwaslab will try to download it first. But the connection was bad so the error occurred. You can try gl.download_ref("ensembl_hg19_gtf",overwrite=True) to download it first (you may need to try several times due to poor connections), and then plot.

Sheeya-Dong commented 8 months ago

Got it! Thank you!