apache / aurora

Apache Aurora - A Mesos framework for long-running services, cron jobs, and ad-hoc jobs
https://aurora.apache.org
Apache License 2.0
632 stars 233 forks source link

Aurora pause abnormally in variable batch update strategy #89

Open chungjin opened 4 years ago

chungjin commented 4 years ago

When the variable batch size set to [5,5], with autopause to true, it should only pause twice, each happens after one batch completes. But in actual, aurora pause when batch updating is still in progress, and it pause 4 times in total.

How to reproduce? variable batch size [5,5], with auto pause to true, SLA sets to 70% percent.

ridv commented 4 years ago

Investing this, there is some bug where shards were already marked as failed are being retried again once a resume is sent.

So for example, let's we have to replace shards [0,1,2] and shard 0 ends up being marked as failed, shard 1 is successful, and we pause before we finish evaluating shard 2.

Our update is now in the following state: 0 - failed 1- success 2- working

When we resume the update, shard 0 is retried, even though it should be skipped due to the fact that it failed.

Working on a fix for this.

ridv commented 4 years ago

Correcting my previous statement, it's not the updater that's retrying shard 0, it's the scheduler itself which is the correct behavior. Even though the shard update failed, the failure is only a signal to the updating mechanism.

The unexpected number of pauses were indeed a bug, thanks for finding it @chungjin, I've landed a fix for this at https://github.com/apache/aurora/commit/3e31bc0f353e58e38d06745c259f492bc4ed76c5

Let's try and reproduce the issue. If we can confirm it as fixed, we can go ahead and close this issues as fixed.