airalab / robonomics

Robonomics node implementation for Polkadot ecosystem; Kusama parachain slot #2048 since January 2022
https://robonomics.subscan.io/
Apache License 2.0
217 stars 58 forks source link

Fix the health check procedure #326

Closed khssnv closed 1 year ago

khssnv commented 1 year ago

Currently, the node status reported by the docker image is always unhealthy because of the following reasons. The PR fixes the health check procedure.

  1. Health check procedure sends state queries to a static IP address.

    The PR suggests using the loopback where the node listens in all modes.

  2. Querying query.parachains.heads always returns an error

    Error: Cannot find query.parachains, your chain does not have the parachains pallet exposed in the runtime\n    at assert (/usr/lib/node_modules/@polkadot/api-cli/node_modules/@polkadot/util/cjs/assert.js:31:11)

    This PR suggests using query.system.number instead to consider a node unhealthy in case the node does not receive or produce new blocks.

    The idea of considering the node as unhealthy when it stays on the same block can be discussed because it may produce an unwanted behavior. For instance, RPC nodes may become unreachable during the events of consensus failure when the network may stop blocks producing too. But this is not a topic of discussion here while query.system.number implements the same idea as query.parachains.heads used before.

  3. Dockerfile directive deletes directories with system binaries where it tries to copy healthcheck.sh later. Also, the script requires some system utils to work.

    The PR suggests preserving /usr/bin and /usr/sbin to let the health check work. It may make security concerns, but further improvements should be implemented in order to let the health check procedure work without system utilities. The accompanying growth of the image size from 744M to 880M does not look relatively large.

khssnv commented 1 year ago

@PavelSheremetev, please review if possible.