frmscoe / General-Issues

This repo exists to track current work and any issues within the FRMS CoE
2 stars 0 forks source link

Graceful exit and process recovery to prevent the processor zombie APPocoalypse #275

Closed Justus-at-Tazama closed 10 months ago

Justus-at-Tazama commented 10 months ago

Problem statement:

Everything that connects to arango/redis/nats goes into a zombie state if it couldn't connect to said service.

image In this image, the pods are unable to reconnect when NATS and redis connections have not been established. The pods have restarted, but are not connecting.

Recommended solution:

  1. We need to ensure that catch-block process exits with an error code to allow the associated Kubernetes pod to restart, after the (to be implemented) retries have failed:

In index.ts, we can add a process.exit(1) statement to achieved a graceful exit when the error condition is triggered:

const numCPUs = os.cpus().length > configuration.maxCPU ? configuration.maxCPU + 1 : os.cpus().length + 1;

if (cluster.isPrimary && configuration.maxCPU !== 1) {
  for (let i = 1; i < numCPUs; i++) {
    cluster.fork();
  }

  cluster.on('exit', (worker, code, signal) => {
    cluster.fork();
  });
} else {
  (async () => {
    try {
      if (process.env.NODE_ENV !== 'test') {
        await runServer();
      }
    } catch (err) {
      loggerService.error(`Error while starting NATS server on Worker ${process.pid}`, err);
      process.exit(1); // ADDED THIS
    }
  })();
}

process.exit(1) will allow the program to exit with an error condition 1 that can then be picked up Kubternetes to allow it to restart the pod.

  1. When NATS retries expires, we also want to trigger the same behaviour via process.exit(1) to re-establish the NATS connections in the event of a processor failure.

Impact of the change:

Every processor that interacts with NATS or uses the library to instantiate database (Arango) or redis connections

Lenbkr commented 10 months ago

Awaiting status change to 'Todo'