RajLabMSSM / catalogueR

Easy-to-use R wrappers for the eQTL Catalogue's API (both with tabix and the REST API).
https://RajLabMSSM.github.io/catalogueR
MIT License
13 stars 1 forks source link

ftp.ebi.ac.uk avaialable, but GTEx_ge_brain_frontal_cortex.all.tsv.gz times out #5

Open paul-shannon opened 3 years ago

paul-shannon commented 3 years ago

Thank for this fine package - very useful in our work on Alzheimer's Disease.

I find intermittent - sometimes lasting - problems with the ftp service the package uses. Here is an example, establishing first that connectivity is good, then showing the error.

ping ftp.ebi.ac.uk
PING ftp.g.ebi.ac.uk (193.62.197.74): 56 data bytes
64 bytes from 193.62.197.74: icmp_seq=0 ttl=53 time=164.786 ms
64 bytes from 193.62.197.74: icmp_seq=1 ttl=53 time=180.432 ms
64 bytes from 193.62.197.74: icmp_seq=2 ttl=53 time=166.346 ms
64 bytes from 193.62.197.74: icmp_seq=3 ttl=53 time=169.450 ms

The specific file request times out:

[E::hts_open_format] Failed to open file "ftp://ftp.ebi.ac.uk/pub/databases/spot/eQTL/csv/GTEx/ge/GTEx_ge_brain_frontal_cortex.all.tsv.gz" :
Operation timed out
Couldn't open "ftp://ftp.ebi.ac.uk/pub/databases/spot/eQTL/csv/GTEx/ge/GTEx_ge_brain_frontal_cortex.all.tsv.gz": Operation timed out
zcat: (stdin): unexpected end of file
paul-shannon commented 3 years ago

I think this is a better problem report:

eQTL_Catalogue.fetch(unique_id="GTEx.brain_frontal_cortex", chrom="8",, bp_lower=27610984, bp_upper=27610987)
[1] "CONDA:: Could not identify tabix executable in echoR env. Defaulting to generic 'tabix' command"
[1] "tabix ftp://ftp.ebi.ac.uk/pub/databases/spot/eQTL/csv/GTEx/ge/GTEx_ge_brain_frontal_cortex.all.tsv.gz 8:27610984-27610987"
[E::hts_open_format] Failed to open file "ftp://ftp.ebi.ac.uk/pub/databases/spot/eQTL/csv/GTEx/ge/GTEx_ge_brain_frontal_cortex.all.tsv.gz" : Operation timed out
Couldn't open "ftp://ftp.ebi.ac.uk/pub/databases/spot/eQTL/csv/GTEx/ge/GTEx_ge_brain_frontal_cortex.all.tsv.gz": Operation timed out

My tabix is Cellar/htslib/1.14/bin/tabix

paul-shannon commented 3 years ago

more info. running on ubuntu, a different tabix, same problem:

tabix ftp://ftp.ebi.ac.uk/pub/databases/spot/eQTL/csv/GTEx/ge/GTEx_ge_brain_frontal_cortex.all.tsv.gz 8:27610984-276109801
[E::hts_open_format] Failed to open file "ftp://ftp.ebi.ac.uk/pub/databases/spot/eQTL/csv/GTEx/ge/GTEx_ge_brain_frontal_cortex.all.tsv.gz" : Operation timed out
Couldn't open "ftp://ftp.ebi.ac.uk/pub/databases/spot/eQTL/csv/GTEx/ge/GTEx_ge_brain_frontal_cortex.all.tsv.gz": Operation timed out

any thoughts? It's clear this problem is outside of catalogueR!

bschilder commented 3 years ago

Hi @paul-shannon, glad you're finding this tool useful. Thanks for pointing out this issue. I'll look into this and try to figure out what's going on here.

Some potential sources:

Potentially related: https://github.com/eQTL-Catalogue/eQTL-Catalogue-resources/issues/15

bschilder commented 3 years ago

@kauralasoo is there anything on eQTL Catalogue's end that might be causing unstable connections to the FTP server?

I just confirmed that the file paths haven't changed, so they do indeed seem to exist.

kauralasoo commented 2 years ago

Hi @paul-shannon and @bschilder,

We just received a confirmation form the EBI helpdesk that the root cause for this was that Paul's IP address had been blocked by the EBI firewall. Paul's IP has been whitelisted now, but unfortunately there is no good solution prevent it from happening to other users, because tabix requests over FTP (incomplete downloads) look a lot like DDoS attacks to the firewall. The REST API is much more robust, because it is able to rate limit the number of requests by IP address on its own.

Best, Kaur

bschilder commented 2 years ago

Thanks so much for the response @kauralasoo! This is all really helpful info. I'll make some adjustments to catalogueR and may make it so that the REST API is the default method.

Update in dev branch

paul-shannon commented 2 years ago

Hi Brian,

One possible caution: Kaur explained to me this about GTEx:

Unfortunately the uniformly processed GTEx summary statistics are currently not available via the API. We hope to fix this with the next release planned for January 2022. However, we do have the official GTEx V8 summary statistics in the API. The study ID for those is GTEx_V8. Thus, this command works:

https://www.ebi.ac.uk/eqtl/api/chromosomes/8/associations?paginate=False&study=GTEx_V8&qtl_group=Brain_Cortex&quant_method=ge&bp_lower=27603335&bp_upper=27608281

We've found that the official imported GTEx v8 summary statistics have slightly better power than our re-processed ones, probably due to better handling of covariates.

So perhaps, in your code, in the construction of the REST url, you could substitute like this, at least until the next release?

study=GTEx_V8 for study=GTEX

As it is now, none of the valuable GTEx eQTLs are available when using the REST interface to catalogueR.

On Nov 30, 2021, at 7:48 AM, Brian M. Schilder @.***> wrote:

Thanks so much for the response @kauralasoo! This is all really helpful info. I'll make some adjustments to catalogueR and may make it so that the REST API is the default method.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

bschilder commented 2 years ago

Thanks for the helpful info @paul-shannon, hadn't realized this!

catalogueR::eQTL_Catalogue.list_datasets currently relies on the metadata provided here, tabix_ftp_paths.tsv: https://github.com/eQTL-Catalogue/eQTL-Catalogue-resources/blob/master/tabix/tabix_ftp_paths.tsv

It looks like there is another file called tabix_ftp_paths_imported.tsv: https://github.com/eQTL-Catalogue/eQTL-Catalogue-resources/blob/master/tabix/tabix_ftp_paths_imported.tsv

I'll modify catalogueR::eQTL_Catalogue.list_datasets to integrate this second file as well (with a tryCatch in case it doesn't exist in the future).

bschilder commented 2 years ago

I've just updated the metadata to include GTEX_V8. I also added a new arg to eQTL_Catalogue.list_datasets called include_imported. Setting this to TRUE (default) will integrate the additional datasets in /tabix_ftp_paths_imported.tsv

Currently implemented in the dev branch.

bschilder commented 2 years ago

I'm in the process of overhauling catalogueR to make it compatible with (and take advantage of) the rest of the echoverse, which has expanded quite a bit and is much more robust now.

@kauralasoo has anything changed regarding using tabix to query the eQTL Catalogue? If not, I'm going to add the following instructions whenever someone tries to use the fetch_tabix() function:

WARNING: Querying eQTL Catalogue with tabix will only work 
if your IP address has been whitelisted by an EMBL-EBI server administrator. 
Please request access via this form: 
https://www.ebi.ac.uk/about/contact/support/