HDFGroup / hsds

Cloud-native, service based access to HDF data
https://www.hdfgroup.org/solutions/hdf-kita/
Apache License 2.0
129 stars 53 forks source link

HSDS service node crash #72

Closed jpswinski closed 3 years ago

jpswinski commented 3 years ago

The HSDS service node (started from the runall.sh script running in AWS), has crashed on three separate occasions out of the blue during normal use. The most recent two crashes has occurred using the master branch as of this morning (11/6/20).

Here is a capture of the area in the service node log where the crash occurs:

INFO> data_sel: (slice(25000000, 25217134, 1),)
INFO> node_state: READY
INFO> http_get('http://hsds_head:5100/nodestate')
INFO> http_get status: 200
INFO> health check ok
INFO> chunk_arr shape: (12500000,)
INFO> data_sel: (slice(0, 12500000, 1),)
INFO> chunk_arr shape: (12500000,)
INFO> data_sel: (slice(12500000, 25000000, 1),)
/entrypoint.sh: line 28:     6 Killed                  hsds-servicenode
hsds entrypoint
node type:  sn
running hsds-servicenode
INFO> Service node initializing
INFO> Application baseInit
INFO> baseInit - node_id: sn-eee6d node_port: 80
INFO> using bucket: ####
INFO> aws_iam_role set to: #####
INFO> aws_secret_access_key not set
INFO> aws_access_key_id not set
INFO> aws_region set to: us-west-2
jpswinski commented 3 years ago

Later in the same log:

INFO> health check start
INFO> node_state: INITIALIZING
INFO> register: http://hsds_head:5100/register
INFO> http_post('http://hsds_head:5100/register', {'id': 'sn-5d182', 'port': 80, 'node_type': 'sn'})
INFO> Initiating TCPConnector with limit 100 connections
INFO> http_post status: 500
WARN> POST request error for url: http://hsds_head:5100/register - status: 500
ERROR> HEAD node seems to be down.
Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x7f69326b5dc0>
Task exception was never retrieved
future: <Task finished name='Task-3' coro=<healthCheck() done, defined at /usr/local/lib/python3.8/site-packages/hsds/basenode.py:414> exception=SystemExit(1)>
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/hsds/basenode.py", line 87, in register
    rsp_json = await http_post(app, req_reg, data=body)
  File "/usr/local/lib/python3.8/site-packages/hsds/util/httpUtil.py", line 157, in http_post
    raise HTTPInternalServerError()
aiohttp.web_exceptions.HTTPInternalServerError: Internal Server ErrorDuring handling of the above exception, another exception occurred:Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/aiohttp/web.py", line 419, in run_app
    loop.run_until_complete(_run_app(app,
  File "/usr/local/lib/python3.8/asyncio/base_events.py", line 603, in run_until_complete
    self.run_forever()
  File "/usr/local/lib/python3.8/asyncio/base_events.py", line 570, in run_forever
    self._run_once()
  File "/usr/local/lib/python3.8/asyncio/base_events.py", line 1859, in _run_once
    handle._run()
  File "/usr/local/lib/python3.8/asyncio/events.py", line 81, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/local/lib/python3.8/site-packages/hsds/basenode.py", line 433, in healthCheck
    await register(app)
  File "/usr/local/lib/python3.8/site-packages/hsds/basenode.py", line 96, in register
    sys.exit(1)
SystemExit: 1 
jreadey commented 3 years ago

Anything odd in the logs for the head_node?

jpswinski commented 3 years ago

I didn't see anything pop out to me in the head node logs. I will grab the whole thing the next time it happens. It did seem odd to me that the error in the service node log starts with the line that says that the hsds-servicenode is killed. Not sure what that means.

jpswinski commented 3 years ago

There was another service node crash. I verified that there was around 7GB of memory available on the system at the time of the crash. Here is the head node log at the time of the crash:

INFO> health check 2020-11-12T06:27:42Z, cluster_state: READY, node_count: 5
INFO> http_get('http://172.25.0.3:6101/info')
INFO> http_get status: 200
INFO> http_get('http://172.25.0.7:6101/info')
INFO> http_get status: 200
INFO> http_get('http://172.25.0.5:6101/info')
INFO> http_get status: 200
INFO> http_get('http://172.25.0.4:6101/info')
INFO> http_get status: 200
INFO> http_get('http://172.25.0.6:80/info')
INFO> http_get status: 200
INFO> node health check fail_count: 0
REQ> GET: /nodestate [hsds_head:5100]
INFO> nodestate/*/*
INFO RSP> <200> (OK): /nodestate
REQ> GET: /nodestate [hsds_head:5100]
INFO> nodestate/*/*
INFO RSP> <200> (OK): /nodestate
REQ> POST: /register [hsds_head:5100]
INFO> body: {'id': 'sn-bd03f', 'port': 80, 'node_type': 'sn'}
INFO> register host: 172.25.0.6, port: 59620
Error handling request
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/aiohttp/web_protocol.py", line 418, in start
    resp = await task
  File "/usr/local/lib/python3.8/site-packages/aiohttp/web_app.py", line 458, in _handle
    resp = await handler(request)
  File "/usr/local/lib/python3.8/site-packages/hsds/headnode.py", line 283, in register
    "host": body['ip'],
KeyError: 'ip'
REQ> POST: /register [hsds_head:5100]
INFO> body: {'id': 'sn-f3acc', 'port': 80, 'node_type': 'sn'}
INFO> register host: 172.25.0.6, port: 59624
Error handling request
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/aiohttp/web_protocol.py", line 418, in start
    resp = await task
  File "/usr/local/lib/python3.8/site-packages/aiohttp/web_app.py", line 458, in _handle
    resp = await handler(request)
  File "/usr/local/lib/python3.8/site-packages/hsds/headnode.py", line 283, in register
    "host": body['ip'],
KeyError: 'ip'
REQ> GET: /nodestate [hsds_head:5100]
INFO> nodestate/*/*
INFO RSP> <200> (OK): /nodestate
REQ> GET: /nodestate [hsds_head:5100]
INFO> nodestate/*/*
INFO RSP> <200> (OK): /nodestate
REQ> GET: /nodestate [hsds_head:5100]
INFO> nodestate/*/*
INFO RSP> <200> (OK): /nodestate
INFO> health check 2020-11-12T06:27:52Z, cluster_state: READY, node_count: 5
INFO> http_get('http://172.25.0.3:6101/info')
INFO> http_get status: 200
INFO> http_get('http://172.25.0.7:6101/info')
INFO> http_get status: 200
INFO> http_get('http://172.25.0.5:6101/info')
INFO> http_get status: 200
INFO> http_get('http://172.25.0.4:6101/info')
INFO> http_get status: 200
INFO> http_get('http://172.25.0.6:80/info')
INFO> http_get status: 200
WARN> unexpected node_id: sn-a0076 (expecting: sn-26be6)
INFO> node health check fail_count: 1
WARN> Fail_count > 0, Setting cluster_state from READY to INITIALIZING
REQ> POST: /register [hsds_head:5100]
INFO> body: {'id': 'sn-a0076', 'port': 80, 'node_type': 'sn'}
INFO> register host: 172.25.0.6, port: 59628
INFO> Found free node reference: {'node_number': 0, 'node_type': 'sn', 'host': None, 'port': 80, 'id': None, 'connected': '2020-11-11T20:45:34Z', 'failcount': 0, 'healthcheck': '2020-11-12T06:27:42Z'}
INFO> inactive_node_count: 0
INFO> setting cluster state to READY - was: INITIALIZING
INFO RSP> <200> (OK): /register
REQ> GET: /nodestate [hsds_head:5100]
INFO> nodestate/*/*
INFO RSP> <200> (OK): /nodestate
REQ> GET: /nodestate [hsds_head:5100]
INFO> nodestate/*/*
INFO RSP> <200> (OK): /nodestate
REQ> GET: /nodestate [hsds_head:5100]
INFO> nodestate/*/*
INFO RSP> <200> (OK): /nodestate
REQ> GET: /nodestate [hsds_head:5100]
INFO> nodestate/*/*
INFO RSP> <200> (OK): /nodestate
INFO> health check 2020-11-12T06:28:02Z, cluster_state: READY, node_count: 5
jreadey commented 3 years ago

@jpswinski - are you dynamically changing the number of containers?

I get the same error if I do: docker-compose -f ${COMPOSE_FILE} scale dn=4

jpswinski commented 3 years ago

I'm not. The crashes occur randomly during normal use. For instance, it will happen in the middle of reading a dataset.

jreadey commented 3 years ago

Ok, I think what is happening is that after the service node crash, docker is respawning the container. Then when the container registers with the head node, there is a big that is causing the head node to crash.

I have a fix for the headnode (https://github.com/HDFGroup/hsds/commit/4d3176965021d22a8000b7b630f2dfca616c9f3b) so that nodes will be able to re-register. Try this and let me know how it goes.

I don't think this will fix the problem with the service node crashing, but let's first verify the head node fix.

jpswinski commented 3 years ago

I didn't see the head node crash - after reporting the error above, it keeps going and ultimately reports the cluster back in the READY state. For our user facing server, I had to set the restart policy to always so that we can recover from the service node crashes automatically. This is working, and the system restarts the service node after a crash and HSDS is useable again. But the server node crash is still a problem, because it stopping long processing runs right in the middle of processing; requiring them to start over.

jreadey commented 3 years ago

@jpswinski - do you still have a problem with service node restarts?

jpswinski commented 3 years ago

No! It looks like the problem was the amount of memory docker allowed for the container. If it reaches that limit, it will kill the container and restart it.

jreadey commented 3 years ago

Looks like this is resolved - closing