Closed ababaian closed 4 years ago
The other version of this error message is here:
[2020-05-09 23:34:13 +0000] [6] [INFO] Starting gunicorn 20.0.4
[2020-05-09 23:34:13 +0000] [6] [INFO] Listening at: http://0.0.0.0:8000 (6)
[2020-05-09 23:34:13 +0000] [6] [INFO] Using worker: sync
[2020-05-09 23:34:13 +0000] [8] [INFO] Booting worker with pid: 8
Creating new process
Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/usr/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/opt/flask_app/cron.py", line 60, in adjust_autoscaling_loop
autoscaling = boto3.client('autoscaling', region_name=app.config["AWS_REGION"])
File "/usr/local/lib/python3.8/dist-packages/boto3/__init__.py", line 91, in client
return _get_default_session().client(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/boto3/session.py", line 258, in client
return self._session.create_client(
File "/usr/local/lib/python3.8/dist-packages/botocore/session.py", line 823, in create_client
credentials = self.get_credentials()
File "/usr/local/lib/python3.8/dist-packages/botocore/session.py", line 427, in get_credentials
self._credentials = self._components.get_component(
File "/usr/local/lib/python3.8/dist-packages/botocore/session.py", line 923, in get_component
del self._deferred[name]
KeyError: 'credential_provider'
clear_terminated_jobs() finished. Running again in 10 seconds
Might be related to https://github.com/boto/boto3/issues/1592 as I think we're using session objects across threads.
Message today
2020-05-17T22:05:04.619Z | [2020-05-17 22:05:04 +0000] [6] [INFO] Starting gunicorn 20.0.4
-- | --
| 2020-05-17T22:05:04.620Z | [2020-05-17 22:05:04 +0000] [6] [INFO] Listening at: http://0.0.0.0:8000 (6)
| 2020-05-17T22:05:04.620Z | [2020-05-17 22:05:04 +0000] [6] [INFO] Using worker: sync
| 2020-05-17T22:05:04.622Z | [2020-05-17 22:05:04 +0000] [8] [INFO] Booting worker with pid: 8
| 2020-05-17T22:05:04.846Z | Creating new process
| 2020-05-17T22:05:05.872Z | Exception in thread Thread-2:
| 2020-05-17T22:05:05.872Z | Traceback (most recent call last):
| 2020-05-17T22:05:05.872Z | File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
| 2020-05-17T22:05:05.872Z | self.run()
| 2020-05-17T22:05:05.872Z | File "/usr/lib/python3.8/threading.py", line 870, in run
| 2020-05-17T22:05:05.873Z | self._target(*self._args, **self._kwargs)
| 2020-05-17T22:05:05.873Z | File "/opt/flask_app/cron.py", line 60, in adjust_autoscaling_loop
| 2020-05-17T22:05:05.873Z | autoscaling = boto3.client('autoscaling', region_name=app.config["AWS_REGION"])
| 2020-05-17T22:05:05.873Z | File "/usr/local/lib/python3.8/dist-packages/boto3/__init__.py", line 91, in client
| 2020-05-17T22:05:05.873Z | return _get_default_session().client(*args, **kwargs)
| 2020-05-17T22:05:05.873Z | File "/usr/local/lib/python3.8/dist-packages/boto3/session.py", line 258, in client
| 2020-05-17T22:05:05.873Z | return self._session.create_client(
| 2020-05-17T22:05:05.873Z | File "/usr/local/lib/python3.8/dist-packages/botocore/session.py", line 824, in create_client
| 2020-05-17T22:05:05.873Z | endpoint_resolver = self._get_internal_component('endpoint_resolver')
| 2020-05-17T22:05:05.873Z | File "/usr/local/lib/python3.8/dist-packages/botocore/session.py", line 697, in _get_internal_component
| 2020-05-17T22:05:05.873Z | return self._internal_components.get_component(name)
| 2020-05-17T22:05:05.873Z | File "/usr/local/lib/python3.8/dist-packages/botocore/session.py", line 923, in get_component
| 2020-05-17T22:05:05.878Z | del self._deferred[name]
| 2020-05-17T22:05:05.878Z | KeyError: 'endpoint_resolver'
| 2020-05-17T22:05:06.219Z | clear_terminated_jobs() finished. Running again in 600 seconds
Perhaps related, but when trying to load ~96K accessions in to the scheduler I get the following error (happened on 3x attempts). Reducing input to 20K per batch now.
2020-05-26T22:16:57.119Z | Creating new process
-- | --
| 2020-05-26T22:16:58.403Z | clear_terminated_jobs() finished. Running again in 600 seconds
| 2020-05-26T22:16:58.945Z | ajust_autoscaling() finished. Running again in 300 seconds
| 2020-05-26T22:21:59.300Z | ajust_autoscaling() finished. Running again in 300 seconds
| 2020-05-26T22:22:44.275Z | [2020-05-26 22:22:44 +0000] [6] [CRITICAL] WORKER TIMEOUT (pid:8)
| 2020-05-26T22:22:44.276Z | [2020-05-26 22:22:44 +0000] [8] [INFO] Worker exiting (pid: 8)
| 2020-05-26T22:22:44.340Z | [2020-05-26 22:22:44 +0000] [11] [INFO] Booting worker with pid: 11
| 2020-05-26T22:22:44.552Z | Creating new process
| 2020-05-26T22:22:44.563Z | Exception in thread Thread-2:
| 2020-05-26T22:22:44.563Z | Traceback (most recent call last):
| 2020-05-26T22:22:44.563Z | File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
| 2020-05-26T22:22:44.563Z | self.run()
| 2020-05-26T22:22:44.563Z | File "/usr/lib/python3.8/threading.py", line 870, in run
| 2020-05-26T22:22:44.564Z | self._target(*self._args, **self._kwargs)
| 2020-05-26T22:22:44.564Z | File "/opt/flask_app/cron.py", line 60, in adjust_autoscaling_loop
| 2020-05-26T22:22:44.565Z | autoscaling = boto3.client('autoscaling', region_name=app.config["AWS_REGION"])
| 2020-05-26T22:22:44.565Z | File "/usr/local/lib/python3.8/dist-packages/boto3/__init__.py", line 91, in client
| 2020-05-26T22:22:44.566Z | return _get_default_session().client(*args, **kwargs)
| 2020-05-26T22:22:44.566Z | File "/usr/local/lib/python3.8/dist-packages/boto3/session.py", line 258, in client
| 2020-05-26T22:22:44.567Z | return self._session.create_client(
| 2020-05-26T22:22:44.567Z | File "/usr/local/lib/python3.8/dist-packages/botocore/session.py", line 823, in create_client
| 2020-05-26T22:22:44.568Z | credentials = self.get_credentials()
| 2020-05-26T22:22:44.568Z | File "/usr/local/lib/python3.8/dist-packages/botocore/session.py", line 427, in get_credentials
| 2020-05-26T22:22:44.568Z | self._credentials = self._components.get_component(
| 2020-05-26T22:22:44.568Z | File "/usr/local/lib/python3.8/dist-packages/botocore/session.py", line 923, in get_component
| 2020-05-26T22:22:44.569Z | del self._deferred[name]
| 2020-05-26T22:22:44.569Z | KeyError: 'credential_provider'
| 2020-05-26T22:22:45.837Z | clear_terminated_jobs() finished. Running again in 600 seconds
So while I was booting today it appears that the credentials error arises when you run the create_tunnel
script to quickly while the instance is still booting up. There appears to be a race condition of some sort and if you just give everything time it can boot up normally.
The adding 90K accessions at once issue is semi-resolved as I just add 20K 'batch' of accessions. So far I have 60K loaded into the scheduler and nothing has blown up.
I don't remember which commit closed this, but it is no longer an issue.
When running
Serratus
viaterraform apply
, all the initial instances are going online and about 20% of the time thescheduler
runs into a problem with what appears to beaws credentials
(IAM is attached correctly) and therefore the cluster has to be restarted. This only happens at initiation so it's easy to fix but kind of annoyingCloudwatch logs