geneontology / pipeline

Declarative pipeline for the Gene Ontology.
https://build.geneontology.org/job/geneontology/job/pipeline/
BSD 3-Clause "New" or "Revised" License
5 stars 5 forks source link

Remove eutils remote calls from late run #173

Open kltm opened 4 years ago

kltm commented 4 years ago

snapshot is currently failing about once a week with:

19:34:36  requests.exceptions.ConnectionError: HTTPSConnectionPool(host='eutils.ncbi.nlm.nih.gov', port=443): Max retries exceeded with url: /entrez/eutils/efetch.fcgi?db=taxonomy (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f6014619358>: Failed to establish a new connection: [Errno -2] Name or service not known',))

It seems to be in info for stat gathering. We're going to have to remove or work around it. I suspect the same information can be grabbed and held from another source.

kltm commented 4 years ago

@lpalbou

What information is this call for again? And does it need to be gathered on every run? I gather that this is mostly to give nice human-readable labels, or maybe that's mistaken? If so, couldn't we just pull this info from the ontology?

If it's slow-moving metadata, that cannot be derived from our other input, it would probably be better at this point to just gzip it up and toss it into go-site/metadata for reuse. You mentioned your https://geneontology.s3.amazonaws.com/taxon_map.json , which looks like it might be a good jumping-off point if we cannot get this from other internal sources.

IIRC, @goodb also mentioned an idea about this a while back.

lpalbou commented 4 years ago

So looking quickly at the repo, I see two scripts calling eutils:

My guess is the gaf script is run first and we get banned by eutils for too many calls. Suggestion: fix the gaf script to create a params argument with all the desired pmids and call the API only once. That should suffice solving that issue.

lpalbou commented 4 years ago

Also for future reference, I generate my taxon json mapping fallback from ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdmp.zip.

kltm commented 4 years ago

@lpalbou Okay, just to clarify my questions from https://github.com/geneontology/pipeline/issues/173#issuecomment-656909961, where does this information get exposed in the stats again and is it likely to change between runs?

lpalbou commented 4 years ago

It's used in the stats to show both the taxon id and taxon labels.

It's not new and that part hasn't changed since we introduced the stats in October 2019. So it has nothing to do with the update of the stats code. As explained above, the eutils API is bombarded by the gaf script that send probably 1000 of queries and your server gets banned.

kltm commented 4 years ago

@lpalbou What I'm confused about here is that you are referencing https://github.com/geneontology/go-site/blob/master/scripts/gaf_pmid_author_list.pl as the root cause of this issue. As far as I know, it is not used in the pipeline, unless it is somehow being called by one of your scripts:

sjcarbon@moiraine:~/local/src/git/go-site[master]$:) grep -r "gaf_pmid_author_list" *
scripts/gaf_pmid_author_list.pl:## gaf_pmid_author_list.pl - a program to take PMID from gaf files and retrieve author list from NCBI PubMed server
sjcarbon@moiraine:~/local/src/git/go-site[master]$:)
sjcarbon@moiraine:~/local/src/git/pipeline[snapshot]$:) grep -r "gaf_pmid_author_list" *
sjcarbon@moiraine:~/local/src/git/pipeline[snapshot]$:( 
lpalbou commented 4 years ago

I don't use the gaf_pmid script. I just did a quick search in the go-site repo for scripts calling eutils and found this one sending thousands of calls to eutils and assumed the issue was coming from there (if that script is called, it could easily trigger a block from eutils and any subsequent call like the one in stats would be denied, hence the error log).

If not all scripts in go-site/scripts/ are executed in your pipeline, maybe we should do a clean up ?

I am also a bit concern that a production server has so many internet issues, never had that locally (and I do have a bad connexion) or on any AWS server. Maybe something to improve ?

Having said that and considering those current internet issues, here are the action items:

kltm commented 4 years ago

go-site/scripts is just a directory that holds general utility scripts. While some are used in the pipeline, most are not. While figuring out what is up with the network would be good, whether at LBL or elsewhere, for the scope of this issue we probably just need to move forward as it is now.

If the desired result is generating the the taxon/label map, why not use the (fully merged) ontology or some other local file? For example, http://purl.obolibrary.org/obo/ncbitaxon/subsets/taxslim-disjoint-over-in-taxon.owl and http://purl.obolibrary.org/obo/ncbitaxon/subsets/taxslim.owl should already be taken into go-lego for the run and would avoid unnecessary network access. Understanding that might make it more clear to me the difference between different approaches here.

lpalbou commented 4 years ago

Added baby proofing of API call: https://github.com/geneontology/go-site/commit/c90afdf88d46f47a0c551cca2860494cba9791cc

lpalbou commented 4 years ago

Reminder: issue solved 3 days ago. Even if eutils completely disappeared, the script will go on.

But for completeness, here is the full log of the error before it gets lost:

20:17:39  + python3 /tmp/go_reports.py -g http://localhost:8080/solr/ -s http://current.geneontology.org/release_stats/go-stats.json -n http://current.geneontology.org/release_stats/go-stats-no-pb.json -c http://skyhook.berkeleybop.org/snapshot/ontology/go.obo -p http://current.geneontology.org/ontology/go.obo -o /tmp/stats/ -d 2020-07-12
20:18:01  14
20:18:01  
20:18:01  
20:18:01  1a - EXECUTING GO_STATS SCRIPT (INCLUDING PROTEIN BINDING)...
20:18:01  
20:18:01  Will use golr url:  http://localhost:8080/solr/
20:18:01  1 / 4 - Fetching GO terms...
20:18:01  Done.
20:18:01  2 / 4 - Fetching GO annotations...
20:18:01  Done.
20:18:01  3 / 4 - Fetching GO bioentities...
20:18:01  Done.
20:18:01  4 / 4 - Creating Stats...
20:18:01  Traceback (most recent call last):
20:18:01    File "/usr/local/lib/python3.6/dist-packages/urllib3/connection.py", line 171, in _new_conn
20:18:01      (self._dns_host, self.port), self.timeout, **extra_kw)
20:18:01    File "/usr/local/lib/python3.6/dist-packages/urllib3/util/connection.py", line 56, in create_connection
20:18:01      for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
20:18:01    File "/usr/lib/python3.6/socket.py", line 745, in getaddrinfo
20:18:01      for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
20:18:01  socket.gaierror: [Errno -2] Name or service not known
20:18:01  
20:18:01  During handling of the above exception, another exception occurred:
20:18:01  
20:18:01  Traceback (most recent call last):
20:18:01    File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 600, in urlopen
20:18:01      chunked=chunked)
20:18:01    File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 343, in _make_request
20:18:01      self._validate_conn(conn)
20:18:01    File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 849, in _validate_conn
20:18:01      conn.connect()
20:18:01    File "/usr/local/lib/python3.6/dist-packages/urllib3/connection.py", line 314, in connect
20:18:01      conn = self._new_conn()
20:18:01    File "/usr/local/lib/python3.6/dist-packages/urllib3/connection.py", line 180, in _new_conn
20:18:01      self, "Failed to establish a new connection: %s" % e)
20:18:01  urllib3.exceptions.NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x7f39f0bbae80>: Failed to establish a new connection: [Errno -2] Name or service not known
20:18:01  
20:18:01  During handling of the above exception, another exception occurred:
20:18:01  
20:18:01  Traceback (most recent call last):
20:18:01    File "/usr/local/lib/python3.6/dist-packages/requests/adapters.py", line 445, in send
20:18:01      timeout=timeout
20:18:01    File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 638, in urlopen
20:18:01      _stacktrace=sys.exc_info()[2])
20:18:01    File "/usr/local/lib/python3.6/dist-packages/urllib3/util/retry.py", line 398, in increment
20:18:01      raise MaxRetryError(_pool, url, error or ResponseError(cause))
20:18:01  urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='eutils.ncbi.nlm.nih.gov', port=443): Max retries exceeded with url: /entrez/eutils/efetch.fcgi?db=taxonomy (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f39f0bbae80>: Failed to establish a new connection: [Errno -2] Name or service not known',))
20:18:01  
20:18:01  During handling of the above exception, another exception occurred:
20:18:01  
20:18:01  Traceback (most recent call last):
20:18:01    File "/tmp/go_reports.py", line 362, in <module>
20:18:01      main(sys.argv[1:])
20:18:01    File "/tmp/go_reports.py", line 194, in main
20:18:01      json_stats = go_stats.compute_stats(golr_url, release_date)
20:18:01    File "/tmp/go_stats.py", line 242, in compute_stats
20:18:01      prepare_globals(all_annotations)
20:18:01    File "/tmp/go_stats.py", line 282, in prepare_globals
20:18:01      data = requests.post(taxon_base_url, data = params)
20:18:01    File "/usr/local/lib/python3.6/dist-packages/requests/api.py", line 112, in post
20:18:01      return request('post', url, data=data, json=json, **kwargs)
20:18:01    File "/usr/local/lib/python3.6/dist-packages/requests/api.py", line 58, in request
20:18:01      return session.request(method=method, url=url, **kwargs)
20:18:01    File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 512, in request
20:18:01      resp = self.send(prep, **send_kwargs)
20:18:01    File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 622, in send
20:18:01      r = adapter.send(request, **kwargs)
20:18:01    File "/usr/local/lib/python3.6/dist-packages/requests/adapters.py", line 513, in send
20:18:01      raise ConnectionError(e, request=request)
20:18:01  requests.exceptions.ConnectionError: HTTPSConnectionPool(host='eutils.ncbi.nlm.nih.gov', port=443): Max retries exceeded with url: /entrez/eutils/efetch.fcgi?db=taxonomy (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f39f0bbae80>: Failed to establish a new connection: [Errno -2] Name or service not known',))

Reference: https://build.geneontology.org/job/geneontology/job/pipeline/job/snapshot/1230/execution/node/563/log/

Please note:

20:18:01  socket.gaierror: [Errno -2] Name or service not known

20:18:01  urllib3.exceptions.NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x7f39f0bbae80>: Failed to establish a new connection: [Errno -2] Name or service not known

So as discussed, your pipeline server failed to create a connexion and it wasn't an IP ban, the service never was found in the first place and I am pretty confident it does not come from eutils. I don't know if it's related but I also saw travis checks failed recently due to connexion issues.