IHTSDO / snowstorm

Scalable SNOMED CT Terminology Server using Elasticsearch
Other
204 stars 80 forks source link

Server architecture and set-up #363

Open Tannjorn opened 2 years ago

Tannjorn commented 2 years ago

Our NRC is currently in the process of setting up a high-availability Snowstorm for distribution of the Norwegian Extension. Is there a best practise for how to set this up?

In discussions with our service provider we face two important issues.

  1. Should Snowstorm and Elastic run on the same or separate servers? In testing we have mainly set up both Elastic and Snowstorm on the same virtual machine (VM). This seems to give good performance. Our service provider has tested with a set-up, running Elastic on one VM, and Snowstorm on another. To me this does not make sense, as I thought Elastic was the database engine, and Snowstorm a service running and writing indexes to Elastic. Therefore I would think putting the service (Snowstorm) on a different VM would reduce performance - without any real advantage. But I am in no way sure about this.
  2. Should multiple servers be set up for failsafe and/or load-balancing? We have so far not done this in our NRC controlled environment, but that is because we have only tested with focus on terminology dev - not server set-up. Our service provider is suggesting a set-up with two VMs running Elastic, optionally also two VMs running Snowstorm. First suggestion being three VMs, where the one Snowstorm is working against two VMs with elastic - and these VMs are load-balanced. To me this sounds unstable. I would think two VMs, with one as a "hot-spare" could work, but setting them up as load balanced live servers could possibly cause instability.

I'll give a disclaimer for not getting this entirely correct, but does anyone have experience and advice on server set-up?

Kind regards, Jørn

kaicode commented 2 years ago

Some thoughts and suggestions here: Snowstorm docs - Load Balancing (new documentation).

We welcome input from the community on their load balancing experiences!

hurtigcodes commented 2 years ago

Very nice read, @kaicode! I have a question there.

If you run Option B and route traffic to a write instance, wouldn't it be sufficient to route requests which contain async jobs only? I don't see a problem with "regular" write operations with a 2+ Snowstorms and 1+ replicas setup.

Also, I was thinking about practical solutions while reading it. While it may be convenient to route traffic per http method because the registration of asynchronous write operations are POSTs, it will fail to find a status for an async job as those requests are GETs, will it not? Is it then best to route all requests to certain endpoints, such as /imports, to my "write-only instance"?

kaicode commented 2 years ago

If you are using classification functionality scaling up authoring instances or restarting one of them will fail all classification jobs.

Note: Please bear in mind that when any Snowstorm instance starts it will fail all running classification jobs because the application assumes that the authoring instance has been killed and restarted.

Feel free to raise a feature request for disabling this behaviour via configuration.


If you are not using classification or can manage restarts then yes, running multiple authoring instances could work. I've listed the endpoints that would need specific routing rules below.

The following wildcard expressions match endpoints where there is an asynchronous function with associated functions that should all be routed to a single instance because they are not backed by an index:

The following have asynchronous functionality that is backed by an index. The results could be fetched from any instance. Bear in mind that the operation will not complete if the instance running the operation is killed:

Disclaimer: This is the result of a code review I have just completed on Snowstorm 7.5.4. Additional asynchronous endpoints may be added in the future. Multiple authoring instances has not been tested by SNOMED International. I can not see any reason why it would not work but please carry out your own testing.

hurtigcodes commented 2 years ago

Thanks, @kaicode! We will make sure to test this out.