ababaian / serratus

Ultra-deep search for novel viruses
http://serratus.io
GNU General Public License v3.0
253 stars 33 forks source link

Scheduler instance boot failure #85

Closed ababaian closed 4 years ago

ababaian commented 4 years ago

When running Serratus via terraform apply, all the initial instances are going online and about 20% of the time the scheduler runs into a problem with what appears to be aws credentials (IAM is attached correctly) and therefore the cluster has to be restarted. This only happens at initiation so it's easy to fix but kind of annoying

Cloudwatch logs

2020-05-13T18:19:35.453Z | [2020-05-13 18:19:35 +0000] [6] [INFO] Starting gunicorn 20.0.4
-- | --
  | 2020-05-13T18:19:35.454Z | [2020-05-13 18:19:35 +0000] [6] [INFO] Listening at: http://0.0.0.0:8000 (6)
  | 2020-05-13T18:19:35.454Z | [2020-05-13 18:19:35 +0000] [6] [INFO] Using worker: sync
  | 2020-05-13T18:19:35.455Z | [2020-05-13 18:19:35 +0000] [8] [INFO] Booting worker with pid: 8
  | 2020-05-13T18:19:35.657Z | Creating new process
  | 2020-05-13T18:19:36.681Z | Exception in thread Thread-2:
  | 2020-05-13T18:19:36.681Z | Traceback (most recent call last):
  | 2020-05-13T18:19:36.681Z | File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
  | 2020-05-13T18:19:36.682Z | self.run()
  | 2020-05-13T18:19:36.682Z | File "/usr/lib/python3.8/threading.py", line 870, in run
  | 2020-05-13T18:19:36.682Z | self._target(*self._args, **self._kwargs)
  | 2020-05-13T18:19:36.682Z | File "/opt/flask_app/cron.py", line 60, in adjust_autoscaling_loop
  | 2020-05-13T18:19:36.682Z | autoscaling = boto3.client('autoscaling', region_name=app.config["AWS_REGION"])
  | 2020-05-13T18:19:36.682Z | File "/usr/local/lib/python3.8/dist-packages/boto3/__init__.py", line 91, in client
  | 2020-05-13T18:19:36.683Z | return _get_default_session().client(*args, **kwargs)
  | 2020-05-13T18:19:36.683Z | File "/usr/local/lib/python3.8/dist-packages/boto3/session.py", line 258, in client
  | 2020-05-13T18:19:36.683Z | return self._session.create_client(
  | 2020-05-13T18:19:36.683Z | File "/usr/local/lib/python3.8/dist-packages/botocore/session.py", line 824, in create_client
  | 2020-05-13T18:19:36.683Z | endpoint_resolver = self._get_internal_component('endpoint_resolver')
  | 2020-05-13T18:19:36.684Z | File "/usr/local/lib/python3.8/dist-packages/botocore/session.py", line 697, in _get_internal_component
  | 2020-05-13T18:19:36.684Z | return self._internal_components.get_component(name)
  | 2020-05-13T18:19:36.684Z | File "/usr/local/lib/python3.8/dist-packages/botocore/session.py", line 923, in get_component
  | 2020-05-13T18:19:36.684Z | del self._deferred[name]
  | 2020-05-13T18:19:36.684Z | KeyError: 'endpoint_resolver'
ababaian commented 4 years ago

The other version of this error message is here:

[2020-05-09 23:34:13 +0000] [6] [INFO] Starting gunicorn 20.0.4
[2020-05-09 23:34:13 +0000] [6] [INFO] Listening at: http://0.0.0.0:8000 (6)
[2020-05-09 23:34:13 +0000] [6] [INFO] Using worker: sync
[2020-05-09 23:34:13 +0000] [8] [INFO] Booting worker with pid: 8
Creating new process
Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/usr/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/opt/flask_app/cron.py", line 60, in adjust_autoscaling_loop
autoscaling = boto3.client('autoscaling', region_name=app.config["AWS_REGION"])
File "/usr/local/lib/python3.8/dist-packages/boto3/__init__.py", line 91, in client
return _get_default_session().client(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/boto3/session.py", line 258, in client
return self._session.create_client(
File "/usr/local/lib/python3.8/dist-packages/botocore/session.py", line 823, in create_client
credentials = self.get_credentials()
File "/usr/local/lib/python3.8/dist-packages/botocore/session.py", line 427, in get_credentials
self._credentials = self._components.get_component(
File "/usr/local/lib/python3.8/dist-packages/botocore/session.py", line 923, in get_component
del self._deferred[name]
KeyError: 'credential_provider'
clear_terminated_jobs() finished. Running again in 10 seconds
brietaylor commented 4 years ago

Might be related to https://github.com/boto/boto3/issues/1592 as I think we're using session objects across threads.

ababaian commented 4 years ago

Message today


2020-05-17T22:05:04.619Z | [2020-05-17 22:05:04 +0000] [6] [INFO] Starting gunicorn 20.0.4
-- | --
  | 2020-05-17T22:05:04.620Z | [2020-05-17 22:05:04 +0000] [6] [INFO] Listening at: http://0.0.0.0:8000 (6)
  | 2020-05-17T22:05:04.620Z | [2020-05-17 22:05:04 +0000] [6] [INFO] Using worker: sync
  | 2020-05-17T22:05:04.622Z | [2020-05-17 22:05:04 +0000] [8] [INFO] Booting worker with pid: 8
  | 2020-05-17T22:05:04.846Z | Creating new process
  | 2020-05-17T22:05:05.872Z | Exception in thread Thread-2:
  | 2020-05-17T22:05:05.872Z | Traceback (most recent call last):
  | 2020-05-17T22:05:05.872Z | File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
  | 2020-05-17T22:05:05.872Z | self.run()
  | 2020-05-17T22:05:05.872Z | File "/usr/lib/python3.8/threading.py", line 870, in run
  | 2020-05-17T22:05:05.873Z | self._target(*self._args, **self._kwargs)
  | 2020-05-17T22:05:05.873Z | File "/opt/flask_app/cron.py", line 60, in adjust_autoscaling_loop
  | 2020-05-17T22:05:05.873Z | autoscaling = boto3.client('autoscaling', region_name=app.config["AWS_REGION"])
  | 2020-05-17T22:05:05.873Z | File "/usr/local/lib/python3.8/dist-packages/boto3/__init__.py", line 91, in client
  | 2020-05-17T22:05:05.873Z | return _get_default_session().client(*args, **kwargs)
  | 2020-05-17T22:05:05.873Z | File "/usr/local/lib/python3.8/dist-packages/boto3/session.py", line 258, in client
  | 2020-05-17T22:05:05.873Z | return self._session.create_client(
  | 2020-05-17T22:05:05.873Z | File "/usr/local/lib/python3.8/dist-packages/botocore/session.py", line 824, in create_client
  | 2020-05-17T22:05:05.873Z | endpoint_resolver = self._get_internal_component('endpoint_resolver')
  | 2020-05-17T22:05:05.873Z | File "/usr/local/lib/python3.8/dist-packages/botocore/session.py", line 697, in _get_internal_component
  | 2020-05-17T22:05:05.873Z | return self._internal_components.get_component(name)
  | 2020-05-17T22:05:05.873Z | File "/usr/local/lib/python3.8/dist-packages/botocore/session.py", line 923, in get_component
  | 2020-05-17T22:05:05.878Z | del self._deferred[name]
  | 2020-05-17T22:05:05.878Z | KeyError: 'endpoint_resolver'
  | 2020-05-17T22:05:06.219Z | clear_terminated_jobs() finished. Running again in 600 seconds
ababaian commented 4 years ago

Perhaps related, but when trying to load ~96K accessions in to the scheduler I get the following error (happened on 3x attempts). Reducing input to 20K per batch now.


2020-05-26T22:16:57.119Z | Creating new process
-- | --
  | 2020-05-26T22:16:58.403Z | clear_terminated_jobs() finished. Running again in 600 seconds
  | 2020-05-26T22:16:58.945Z | ajust_autoscaling() finished. Running again in 300 seconds
  | 2020-05-26T22:21:59.300Z | ajust_autoscaling() finished. Running again in 300 seconds
  | 2020-05-26T22:22:44.275Z | [2020-05-26 22:22:44 +0000] [6] [CRITICAL] WORKER TIMEOUT (pid:8)
  | 2020-05-26T22:22:44.276Z | [2020-05-26 22:22:44 +0000] [8] [INFO] Worker exiting (pid: 8)
  | 2020-05-26T22:22:44.340Z | [2020-05-26 22:22:44 +0000] [11] [INFO] Booting worker with pid: 11
  | 2020-05-26T22:22:44.552Z | Creating new process
  | 2020-05-26T22:22:44.563Z | Exception in thread Thread-2:
  | 2020-05-26T22:22:44.563Z | Traceback (most recent call last):
  | 2020-05-26T22:22:44.563Z | File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
  | 2020-05-26T22:22:44.563Z | self.run()
  | 2020-05-26T22:22:44.563Z | File "/usr/lib/python3.8/threading.py", line 870, in run
  | 2020-05-26T22:22:44.564Z | self._target(*self._args, **self._kwargs)
  | 2020-05-26T22:22:44.564Z | File "/opt/flask_app/cron.py", line 60, in adjust_autoscaling_loop
  | 2020-05-26T22:22:44.565Z | autoscaling = boto3.client('autoscaling', region_name=app.config["AWS_REGION"])
  | 2020-05-26T22:22:44.565Z | File "/usr/local/lib/python3.8/dist-packages/boto3/__init__.py", line 91, in client
  | 2020-05-26T22:22:44.566Z | return _get_default_session().client(*args, **kwargs)
  | 2020-05-26T22:22:44.566Z | File "/usr/local/lib/python3.8/dist-packages/boto3/session.py", line 258, in client
  | 2020-05-26T22:22:44.567Z | return self._session.create_client(
  | 2020-05-26T22:22:44.567Z | File "/usr/local/lib/python3.8/dist-packages/botocore/session.py", line 823, in create_client
  | 2020-05-26T22:22:44.568Z | credentials = self.get_credentials()
  | 2020-05-26T22:22:44.568Z | File "/usr/local/lib/python3.8/dist-packages/botocore/session.py", line 427, in get_credentials
  | 2020-05-26T22:22:44.568Z | self._credentials = self._components.get_component(
  | 2020-05-26T22:22:44.568Z | File "/usr/local/lib/python3.8/dist-packages/botocore/session.py", line 923, in get_component
  | 2020-05-26T22:22:44.569Z | del self._deferred[name]
  | 2020-05-26T22:22:44.569Z | KeyError: 'credential_provider'
  | 2020-05-26T22:22:45.837Z | clear_terminated_jobs() finished. Running again in 600 seconds
ababaian commented 4 years ago

So while I was booting today it appears that the credentials error arises when you run the create_tunnel script to quickly while the instance is still booting up. There appears to be a race condition of some sort and if you just give everything time it can boot up normally.

The adding 90K accessions at once issue is semi-resolved as I just add 20K 'batch' of accessions. So far I have 60K loaded into the scheduler and nothing has blown up.

ababaian commented 4 years ago

I don't remember which commit closed this, but it is no longer an issue.