Closed jpswinski closed 3 years ago
Later in the same log:
INFO> health check start
INFO> node_state: INITIALIZING
INFO> register: http://hsds_head:5100/register
INFO> http_post('http://hsds_head:5100/register', {'id': 'sn-5d182', 'port': 80, 'node_type': 'sn'})
INFO> Initiating TCPConnector with limit 100 connections
INFO> http_post status: 500
WARN> POST request error for url: http://hsds_head:5100/register - status: 500
ERROR> HEAD node seems to be down.
Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x7f69326b5dc0>
Task exception was never retrieved
future: <Task finished name='Task-3' coro=<healthCheck() done, defined at /usr/local/lib/python3.8/site-packages/hsds/basenode.py:414> exception=SystemExit(1)>
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/hsds/basenode.py", line 87, in register
rsp_json = await http_post(app, req_reg, data=body)
File "/usr/local/lib/python3.8/site-packages/hsds/util/httpUtil.py", line 157, in http_post
raise HTTPInternalServerError()
aiohttp.web_exceptions.HTTPInternalServerError: Internal Server ErrorDuring handling of the above exception, another exception occurred:Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/aiohttp/web.py", line 419, in run_app
loop.run_until_complete(_run_app(app,
File "/usr/local/lib/python3.8/asyncio/base_events.py", line 603, in run_until_complete
self.run_forever()
File "/usr/local/lib/python3.8/asyncio/base_events.py", line 570, in run_forever
self._run_once()
File "/usr/local/lib/python3.8/asyncio/base_events.py", line 1859, in _run_once
handle._run()
File "/usr/local/lib/python3.8/asyncio/events.py", line 81, in _run
self._context.run(self._callback, *self._args)
File "/usr/local/lib/python3.8/site-packages/hsds/basenode.py", line 433, in healthCheck
await register(app)
File "/usr/local/lib/python3.8/site-packages/hsds/basenode.py", line 96, in register
sys.exit(1)
SystemExit: 1
Anything odd in the logs for the head_node?
I didn't see anything pop out to me in the head node logs. I will grab the whole thing the next time it happens. It did seem odd to me that the error in the service node log starts with the line that says that the hsds-servicenode is killed. Not sure what that means.
There was another service node crash. I verified that there was around 7GB of memory available on the system at the time of the crash. Here is the head node log at the time of the crash:
INFO> health check 2020-11-12T06:27:42Z, cluster_state: READY, node_count: 5
INFO> http_get('http://172.25.0.3:6101/info')
INFO> http_get status: 200
INFO> http_get('http://172.25.0.7:6101/info')
INFO> http_get status: 200
INFO> http_get('http://172.25.0.5:6101/info')
INFO> http_get status: 200
INFO> http_get('http://172.25.0.4:6101/info')
INFO> http_get status: 200
INFO> http_get('http://172.25.0.6:80/info')
INFO> http_get status: 200
INFO> node health check fail_count: 0
REQ> GET: /nodestate [hsds_head:5100]
INFO> nodestate/*/*
INFO RSP> <200> (OK): /nodestate
REQ> GET: /nodestate [hsds_head:5100]
INFO> nodestate/*/*
INFO RSP> <200> (OK): /nodestate
REQ> POST: /register [hsds_head:5100]
INFO> body: {'id': 'sn-bd03f', 'port': 80, 'node_type': 'sn'}
INFO> register host: 172.25.0.6, port: 59620
Error handling request
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/aiohttp/web_protocol.py", line 418, in start
resp = await task
File "/usr/local/lib/python3.8/site-packages/aiohttp/web_app.py", line 458, in _handle
resp = await handler(request)
File "/usr/local/lib/python3.8/site-packages/hsds/headnode.py", line 283, in register
"host": body['ip'],
KeyError: 'ip'
REQ> POST: /register [hsds_head:5100]
INFO> body: {'id': 'sn-f3acc', 'port': 80, 'node_type': 'sn'}
INFO> register host: 172.25.0.6, port: 59624
Error handling request
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/aiohttp/web_protocol.py", line 418, in start
resp = await task
File "/usr/local/lib/python3.8/site-packages/aiohttp/web_app.py", line 458, in _handle
resp = await handler(request)
File "/usr/local/lib/python3.8/site-packages/hsds/headnode.py", line 283, in register
"host": body['ip'],
KeyError: 'ip'
REQ> GET: /nodestate [hsds_head:5100]
INFO> nodestate/*/*
INFO RSP> <200> (OK): /nodestate
REQ> GET: /nodestate [hsds_head:5100]
INFO> nodestate/*/*
INFO RSP> <200> (OK): /nodestate
REQ> GET: /nodestate [hsds_head:5100]
INFO> nodestate/*/*
INFO RSP> <200> (OK): /nodestate
INFO> health check 2020-11-12T06:27:52Z, cluster_state: READY, node_count: 5
INFO> http_get('http://172.25.0.3:6101/info')
INFO> http_get status: 200
INFO> http_get('http://172.25.0.7:6101/info')
INFO> http_get status: 200
INFO> http_get('http://172.25.0.5:6101/info')
INFO> http_get status: 200
INFO> http_get('http://172.25.0.4:6101/info')
INFO> http_get status: 200
INFO> http_get('http://172.25.0.6:80/info')
INFO> http_get status: 200
WARN> unexpected node_id: sn-a0076 (expecting: sn-26be6)
INFO> node health check fail_count: 1
WARN> Fail_count > 0, Setting cluster_state from READY to INITIALIZING
REQ> POST: /register [hsds_head:5100]
INFO> body: {'id': 'sn-a0076', 'port': 80, 'node_type': 'sn'}
INFO> register host: 172.25.0.6, port: 59628
INFO> Found free node reference: {'node_number': 0, 'node_type': 'sn', 'host': None, 'port': 80, 'id': None, 'connected': '2020-11-11T20:45:34Z', 'failcount': 0, 'healthcheck': '2020-11-12T06:27:42Z'}
INFO> inactive_node_count: 0
INFO> setting cluster state to READY - was: INITIALIZING
INFO RSP> <200> (OK): /register
REQ> GET: /nodestate [hsds_head:5100]
INFO> nodestate/*/*
INFO RSP> <200> (OK): /nodestate
REQ> GET: /nodestate [hsds_head:5100]
INFO> nodestate/*/*
INFO RSP> <200> (OK): /nodestate
REQ> GET: /nodestate [hsds_head:5100]
INFO> nodestate/*/*
INFO RSP> <200> (OK): /nodestate
REQ> GET: /nodestate [hsds_head:5100]
INFO> nodestate/*/*
INFO RSP> <200> (OK): /nodestate
INFO> health check 2020-11-12T06:28:02Z, cluster_state: READY, node_count: 5
@jpswinski - are you dynamically changing the number of containers?
I get the same error if I do: docker-compose -f ${COMPOSE_FILE} scale dn=4
I'm not. The crashes occur randomly during normal use. For instance, it will happen in the middle of reading a dataset.
Ok, I think what is happening is that after the service node crash, docker is respawning the container. Then when the container registers with the head node, there is a big that is causing the head node to crash.
I have a fix for the headnode (https://github.com/HDFGroup/hsds/commit/4d3176965021d22a8000b7b630f2dfca616c9f3b) so that nodes will be able to re-register. Try this and let me know how it goes.
I don't think this will fix the problem with the service node crashing, but let's first verify the head node fix.
I didn't see the head node crash - after reporting the error above, it keeps going and ultimately reports the cluster back in the READY state. For our user facing server, I had to set the restart policy to always so that we can recover from the service node crashes automatically. This is working, and the system restarts the service node after a crash and HSDS is useable again. But the server node crash is still a problem, because it stopping long processing runs right in the middle of processing; requiring them to start over.
@jpswinski - do you still have a problem with service node restarts?
No! It looks like the problem was the amount of memory docker allowed for the container. If it reaches that limit, it will kill the container and restart it.
Looks like this is resolved - closing
The HSDS service node (started from the runall.sh script running in AWS), has crashed on three separate occasions out of the blue during normal use. The most recent two crashes has occurred using the master branch as of this morning (11/6/20).
Here is a capture of the area in the service node log where the crash occurs: