Netflix / conductor

Conductor is a microservices orchestration engine.
Apache License 2.0
12.82k stars 2.34k forks source link

TaskRunnerConfigurer performance issue on server #1922

Open james-deee opened 4 years ago

james-deee commented 4 years ago

We were looking to migrate over to TaskRunnerConfigurer, but we noticed that there is no longer a concept of making the batch endpoint call. Instead, it appears that there's a polling thread 1:1 to a worker, and that polls the server for 1 task at a time.

It sounds like maybe this was done to avoid something like this: (https://github.com/Netflix/conductor/issues/694), but we have some concerns about performance of this new mechanism against our Conductor Server.

Now instead of making a single query against the sever to poll for tasks we'll be issuing X polling queries. Not only will this put undue stress on the Conductor Server itself, but it will ALSO put a lot of stress on the backing Mysql database as well.

Can we get some insight into this decision, and if there's a batching mode that will be available in TaskRunnerConfigurer before the old Coordinator is removed? It's a no-go for us in the current implementation of this for the performance reasons laid out above.

apanicker-nflx commented 4 years ago

it appears that there's a polling thread 1:1 to a worker....... Can we get some insight into this decision

This is correct, this was done to address the issue where a polled task could be lost if the instance that polled for it is terminated before processing the task in addition to #694. Additionally, the startTime on the task is misleading because the task could be polled and held in the client-side queue but not actually started processing by a worker thread. This could mean that the task could be timed_out by the server even if the task is polled and held in the client-side queue. Additionally, this reduces the network traffic generated by the client since we no longer need to ack task messages.

if there's a batching mode that will be available in TaskRunnerConfigurer

We could look into adding a batch poll but the size of the batch would still be limited to the number of available worker threads for processing, instead of size of an in-memory queue. The drawbacks of having an in-memory queue on the client far outweigh the pros, hence the decision to do away with it. Please let us know if this is an enhancement that you would like to contribute to the client.

james-deee commented 4 years ago

@apanicker-nflx thank you for the response.

Yeah we'll have to have batching available for us if we are going to move to TaskRunnerConfigurer. I haven't really dived into the code as of it, but will it be relatively trivial to know how many available workers there would be when it's time to poll again? If so then that would be pretty easy to add batching there i would think.

aravindanr commented 4 years ago

@james-deee TaskRunnerConfigurer.init schedules a poll for each worker. Instead of that, there could be one poller which contains a Semaphore with permits matching the threadCount. The pollAndExecute method can be modified to acquire a permit from this Semaphore before polling and the number of available permits can be from Semaphore.availablePermits().

james-deee commented 3 years ago

Thanks @aravindanr, if the Netflix is unable to get to this, we can try and take a look in the future; but I would imagine this is something that could be a breaker for people on Mysql + Postgres.

aravindanr commented 3 years ago

@james-deee I have marked this as an enhancement and we will try to address it as part of our Conductor 3.0 effort. This is less of a problem internally for us since we don't use MySQL or Postgres.

james-deee commented 2 years ago

@jxu-nflx YAY! Your PR should fix this issue right??!

https://github.com/Netflix/conductor/pull/3304