galaxyproject / pulsar

Distributed job execution application built for Galaxy
https://pulsar.readthedocs.io
Apache License 2.0
37 stars 50 forks source link

Need better logging/handling when there are staging and MQ publishing problems #291

Closed natefoo closed 1 year ago

natefoo commented 2 years ago
[g2main@galaxy07 ~]$ kubectl logs pulsar-38847417-l5fbt pulsar-container
SETTING UP LOGGING BASED ON [{u'staging_directory': u'/pulsar_staging', u'managers': {u'tacc_k8s': {u'type': u'coexecution'}}, u'message_queue_url': u'amqp://main_pulsar:****@amqp.galaxyproject.org:5671//main_pulsar?ssl=1'}]
2021-11-04 15:04:47,736 INFO  [pulsar.core][MainThread] Starting the Pulsar without a toolbox to white-list.Ensure this application is protected by firewall or a configured private token.
2021-11-04 15:04:47,737 WARNI [galaxy.tool_util.deps][MainThread] Path 'dependencies' does not exist, ignoring
2021-11-04 15:04:47,737 WARNI [galaxy.tool_util.deps][MainThread] Path 'dependencies' is not directory, ignoring
2021-11-04 15:04:47,839 INFO  [pulsar.locks][MainThread] pylockfile module not found, skipping experimental lockfile handling.
2021-11-04 15:04:47,841 INFO  [pulsar.messaging.bind_amqp][MainThread] bind_manager_to_queue called for [amqp://main_pulsar:********@amqp.galaxyproject.org:5671//main_pulsar?ssl=1] and manager [tacc_k8s]
2021-11-04 15:04:47,847 DEBUG [pulsar.managers.staging.pre][[manager=tacc_k8s]-[action=preprocess]-[job=38847417]] Staging tool 'default_notebook.ipynb' via FileAction[url=https://galaxy-web-03.galaxyproject.org/_job_files?job_id=deadbeef&job_key=deadbeef&path=/cvmfs/main.galaxyproject.org/galaxy/tools/interactive/default_notebook.ipynb&file_type=tool,action_type=remote_transfer,path=/cvmfs/main.galaxyproject.org/galaxy/tools/interactive/default_notebook.ipynb] to /pulsar_staging/38847417/tool_files/default_notebook.ipynb
2021-11-04 15:04:47,901 INFO  [pulsar.managers.stateful][MainThread] calling proxy status...
2021-11-04 15:04:47,901 INFO  [pulsar.managers.stateful][MainThread] not preprocessed...

The notebook has not been fetched:

root@pulsar-38847417-pgvr7:/pulsar_staging/38847417/tool_files# ls -al
total 0
drwxr-xr-x 2 root root   6 Nov  4 15:10 .
drwxr-xr-x 8 root root 266 Nov  4 15:10 ..

Pulsar then attempts to notify Galaxy via a failure status update message:

2021-11-04 15:04:47,926 DEBUG [pulsar.messaging.bind_amqp][[manager=tacc_k8s]-[action=preprocess]-[job=38847417]] Publishing Pulsar state change with status failed for job_id 38847417
2021-11-04 15:04:47,927 DEBUG [pulsar.client.amqp_exchange][[manager=tacc_k8s]-[action=preprocess]-[job=38847417]] [publish:8d69e554-3d80-11ec-b650-8277a96e6477] Begin publishing to key pulsar_tacc_k8s__status_update
2021-11-04 15:04:47,935 DEBUG [pulsar.client.amqp_exchange][[manager=tacc_k8s]-[action=preprocess]-[job=38847417]] [publish:8d69e554-3d80-11ec-b650-8277a96e6477] Have producer for publishing to key pulsar_tacc_k8s__status_update
2021-11-04 15:04:49,904 INFO  [pulsar.managers.stateful][MainThread] calling proxy status...

But that message is never received (and as far as I can tell, silently never published).

For the record, the underlying problem appears to be with name resolution in the k8s cluster:

[g2main@galaxy07 ~]$ kubectl exec -it pulsar-38847417-pgvr7 -- bash
Defaulting container name to pulsar-container.
Use 'kubectl describe pod/pulsar-38847417-pgvr7 -n ndc' to see all of the containers in this pod.
root@pulsar-38847417-pgvr7:/# python
Python 2.7.16 |Anaconda, Inc.| (default, Mar 14 2019, 21:00:58) 
[GCC 7.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> requests.get('https://galaxy-web-03.galaxyproject.org/_job_files?job_id=deadbeef&job_key=deadbeef&path=/cvmfs/main.galaxyproject.org/galaxy/tools/interactive/default_notebook.ipynb&file_type=tool')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/requests/api.py", line 75, in get
    return request('get', url, params=params, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/requests/api.py", line 60, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 646, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='galaxy-web-03.galaxyproject.org', port=443): Max retries exceeded with url: /_job_files?job_id=deadbeef&job_key=deadbeef&path=/cvmfs/main.galaxyproject.org/galaxy/tools/interactive/default_notebook.ipynb&file_type=tool (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fbc323dd850>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',))
>>> 
mvdbeek commented 1 year ago

@natefoo You can now set amqp_publish_timeout in pulsar's app config, that should at a minimum fail and log an exception. Not sure if we should retry this or not, probably not ? We've also got sentry_dsn in pulsar's app config, that should help discover errors.

Let me know if that works as it should, I would then set a default value.