Netflix / conductor

Conductor is a microservices orchestration engine.
Apache License 2.0
12.83k stars 2.34k forks source link

Queries Regarding HA Setup on EC2 #230

Closed smkb80 closed 7 years ago

smkb80 commented 7 years ago

Hi,

As part of the POC, I've setup Conductor Server, Dynomite and ElasticSearch on a EC2 instance. Another EC2 instance of same configuration hosts only a Conductor Server which points to the Elastic Search and Dynomite hosted on the previous EC2. Both the conductor servers are behind a Elastic Load Balancer (ELB).

I've defined a Workflow which contains just a HTTPTask which calls a REST endpoint available on the internet.

The total time taken to execute 1000 instances of the workflow is same (around 3 and half mins) if they run on a single conductor server (after shutting down the second instance) or if they run on the load balanced servers. I am calculating the total time by subtracting the start-time of the first instance from the end-time of the last instance obtained from the Workflow UI.

In both the scenarios the 'startWorkflow' requests are coming through the ELB only.

Is this the expected behavior? Or improved performance can be expected from the load balanced servers? Is DB the performance bottleneck in my setup described above?

Any pointer would be really helpful. Below are my Dynomite yml and conductor.properties respectively:

dyn_o_mite:
  datacenter: us-east-1
  dyn_listen: 0.0.0.0:8101
  data_store: 0
  listen: 0.0.0.0:8102
  dyn_seed_provider: simple_provider
  rack: us-east-1b
  servers:
  - 127.0.0.1:6379:1
  tokens: 437425602
  data_store: 0
  mbuf_size: 16384
  stats_listen: 0.0.0.0:22222
db=dynomite
port=8080
workflow.dynomite.cluster.hosts=127.0.0.1:8102:us-east-1b
workflow.dynomite.cluster.name=dyn_o_mite
workflow.namespace.prefix=conductor
workflow.namespace.queue.prefix=conductor_queues
queues.dynomite.threads=100
queues.dynomite.nonQuorum.port=6379
workflow.elasticsearch.url=127.0.0.1:9300
workflow.elasticsearch.index.name=conductor
EC2_AVAILABILITY_ZONE=us-east-1b

The properties for the other Conductor Server contains private IP of the EC2 instance which hosts ElasticSearch and Dynomite.

v1r3n commented 7 years ago

Couple of questions:

  1. How many dynomite nodes do you run? Is it a single node server?
  2. The HTTP endpoint - how long does it take to execute the request?

Conductor scalability depends on dynomite for its queuing, so if you are running a single node dynomite server - so at some point dynomite will be the bottleneck unless you scale it up.

3 and half minute ~ 200 seconds to execute 1000 is about 5 per second. That is certainly too low, but it also depends how long each HTTP task takes to execute?

Do you mind sharing your workflow definition?

smkb80 commented 7 years ago

Thanks @v1r3n . Below are the responses to your queries How many dynomite nodes do you run? Is it a single node server?

The HTTP endpoint - how long does it take to execute the request?

Do you mind sharing your workflow definition?

smkb80 commented 7 years ago

Hi @v1r3n ,

Update - I've added another Dynomite node but don't see any performance improvement. Currently I've 2 EC2 instances with the following components:

First EC2 instance (172.31.52.170) - Redis, Dynomite,ElasticSearch and Conductor Server Second EC2 instance (172.31.60.235) - Redis, Dynomite and Conductor Server

Both the EC2 instances are in us-east-1b

Following are the Conductor and Dynomite Configuration

Conductor Server 1

db=dynomite
port=8080
workflow.dynomite.cluster.hosts=127.0.0.1:8102:us-east-1b;172.31.60.235:8102:us-east-1b
workflow.dynomite.cluster.name=dyn_o_mite
workflow.namespace.prefix=conductor
workflow.namespace.queue.prefix=conductor_queues
queues.dynomite.threads=100
queues.dynomite.nonQuorum.port=6379
workflow.elasticsearch.url=127.0.0.1:9300
workflow.elasticsearch.index.name=conductor
EC2_AVAILABILITY_ZONE=us-east-1b

Conductor Server 2

db=dynomite
port=8080
workflow.dynomite.cluster.hosts=172.31.52.170:8102:us-east-1b;127.0.0.1:8102:us-east-1b
workflow.dynomite.cluster.name=dyn_o_mite
workflow.namespace.prefix=conductor
workflow.namespace.queue.prefix=conductor_queues
queues.dynomite.threads=100
queues.dynomite.nonQuorum.port=6379
workflow.elasticsearch.url=172.31.52.170:9300
workflow.elasticsearch.index.name=conductor
EC2_AVAILABILITY_ZONE=us-east-1b

Dynomite Node 1

dyn_o_mite:
  datacenter: us-east-1
  dyn_listen: 0.0.0.0:8101
  data_store: 0
  listen: 0.0.0.0:8102
  dyn_seed_provider: simple_provider
  dyn_seeds:
   - 172.31.60.235:8101:us-east-1b:us-east-1:1383429731
  rack: us-east-1b
  servers:
  - 127.0.0.1:6379:1
  tokens: 1383429731
  data_store: 0
  mbuf_size: 16384
  stats_listen: 0.0.0.0:22222

Dynomite Node 2

dyn_o_mite:
  datacenter: us-east-1
  dyn_listen: 0.0.0.0:8101
  data_store: 0
  listen: 0.0.0.0:8102
  dyn_seed_provider: simple_provider
  dyn_seeds:
   - 172.31.52.170:8101:us-east-1b:us-east-1:1383429731
  rack: us-east-1b
  servers:
  - 127.0.0.1:6379:1
  tokens: 1383429731
  data_store: 0
  mbuf_size: 16384
  stats_listen: 0.0.0.0:22222

Please suggest if the above configurations are as they should be.

v1r3n commented 7 years ago

There are two properties that controls how many tasks are polled from the queue for system tasks such as HTTP and the number of threads used to process these.

workflow.system.task.worker.poll.count
workflow.system.task.worker.thread.count

I would suggest setting them to something above 25 or so (the default is just 5). There is also a poller which polls every .5 seconds, which I think might be too low at times, so in the next version will look to making it property driven. You can set these values in the property file.

But in one of my test environments, setting the value to 25 for the above helped a lot. Mind trying it out and let us know how it goes?

https://github.com/Netflix/conductor/blob/master/core/src/main/java/com/netflix/conductor/core/execution/tasks/SystemTaskWorkerCoordinator.java

smkb80 commented 7 years ago

Thanks @v1r3n !!

After introducing the 2 parameters mentioned by you with value of 25 (for both), marked improvement has been noticed. The 1000 parallel instances are finishing in 2 and half mins (instead of the earlier 3 and half mins) irrespective of whether the 1000 instances are running on a single Conductor Server or 2 Load Balanced Conductor Servers.

The time taken to execute the 1000 instances now is roughly equal to the time taken by my java client (uses conductor-client apis) to create those 1000 instances. That means, most importantly, the difference between schedule time and start time of the HttpTask is consistently ~600ms across the 1000 instances instead of earlier up to a minute for last 200 odd instances.

Thanks once again for your help!!

v1r3n commented 7 years ago

@smkb80 We are going to update the poll time to a lower and configurable number in the upcoming version (1.80). Also, experiment with workflow.system.task.worker.queue.size parameter to increase the number (if you see lot of queue rejection logs) or keep the thread counter higher than the poll size. The exact values will depend on the type of endpoint you are calling and time it takes to process these requests.

smkb80 commented 7 years ago

Hi @v1r3n ,

I've increased the thread counter to 50 from 25. The poll size is 25. As a result of this change, 1000 parallel instances are executing in 46 seconds on a single server and 23 seconds on 2 load balanced servers.

Thanks a lot for these pointers!!

sk523 commented 7 years ago

@smkb80 How are you deploying dynomite in aws? Can you please share?

smkb80 commented 7 years ago

@sk523 Pls refer to the below link for Dynomite Installation steps and configuration parameters.

https://github.com/Netflix/dynomite/blob/dev/README.md

I've setup Dynomite on 2 nodes. Pls refer to my comment above for the Dynomite configuration YAMLs that I've used.

v1r3n commented 7 years ago

@smkb80 is this still an issue or can we close it?

smkb80 commented 7 years ago

Yes @v1r3n , we can close this. Thanks a lot for all your help!!