Graceful exit and process recovery to prevent the processor zombie APPocoalypse

Problem statement:

Everything that connects to arango/redis/nats goes into a zombie state if it couldn't connect to said service.

In this image, the pods are unable to reconnect when NATS and redis connections have not been established. The pods have restarted, but are not connecting.

Recommended solution:

We need to ensure that catch-block process exits with an error code to allow the associated Kubernetes pod to restart, after the (to be implemented) retries have failed:

In index.ts, we can add a process.exit(1) statement to achieved a graceful exit when the error condition is triggered:

const numCPUs = os.cpus().length > configuration.maxCPU ? configuration.maxCPU + 1 : os.cpus().length + 1;

if (cluster.isPrimary && configuration.maxCPU !== 1) {
  for (let i = 1; i < numCPUs; i++) {
    cluster.fork();
  }

  cluster.on('exit', (worker, code, signal) => {
    cluster.fork();
  });
} else {
  (async () => {
    try {
      if (process.env.NODE_ENV !== 'test') {
        await runServer();
      }
    } catch (err) {
      loggerService.error(`Error while starting NATS server on Worker ${process.pid}`, err);
      process.exit(1); // ADDED THIS
    }
  })();
}

process.exit(1) will allow the program to exit with an error condition 1 that can then be picked up Kubternetes to allow it to restart the pod.

When NATS retries expires, we also want to trigger the same behaviour via process.exit(1) to re-establish the NATS connections in the event of a processor failure.

Impact of the change:

Every processor that interacts with NATS or uses the library to instantiate database (Arango) or redis connections

[x] TMS API
[x] CRSP
[x] Rule Executioner
[x] Typology Processor
[x] CADProc
[x] CMS

frmscoe / General-Issues