IHTSDO / snowstorm

Scalable SNOMED CT Terminology Server using Elasticsearch
Other
204 stars 80 forks source link

AKS Stop/Start not working with Snomed API #361

Open Eneuman opened 2 years ago

Eneuman commented 2 years ago

I have noticed that if you stop (wait for 8 hours) and then starts a Azure Kubernetes Cluster running Snomed API, the Snomed API stops working and the error logs says that the connection to the elastic search instance timeouts. It doesn't seem to be able to recover from this.

The only way to get it to work is to restart the Snomed API pod. This is a bit problematic since stoping the cluster at night is a good way to save money when not using the cluster (ie dev clusters)

I can see two ways to fix this.

  1. Having some retry logic that kan handle connection timouts to ES after the pod has been asleep for 8 hours.
  2. Have a healthcheck endpoint in Snomed API that tries to connect to ES and returns a HTTP 200 if everything is wortking okey.

Is there a health check endpoint in Snomed API today?

Eneuman commented 2 years ago

I think I might have figured out what's going on.

When the cluster starts, it recreates all pods. Since Snomed API pod starts faster then the ElasticSearch pod, it crashes. It would be great if you could configure a retry policy for Snomed API Elastic Connection, like "Try a maximum of 10 times, wait 30 sec between tries"

kaicode commented 2 years ago

Hi @Eneuman, thanks for reporting this issue. It sounds very similar to #177. I started implementing a fix for that but unfortunately it completely breaks Snowstorm when deployed as a standalone java app in AWS EC2 using AWS Elasticsearch service. I am not able to work out what the cause is. I will try to look at this again in the next few weeks.