[Core Feature] Support pause and unpause in scheduler in case admin is down

pmahindrakar-oss commented 3 years ago

Motivation: Why do you think this is important? Scheduler uses admin for sending workflow executions at the scheduled time. But in case of admin being down currently we have a retry mechanism which through configured number of steps using exponential backoff.

Admin being down is a special case where all scheduled workflows will go in this retry loop. The proposal is to introduce a pause mechanism for the scheduler where all the existing schedules are paused and resumed when admin is healthy. We will use admins health api to determine this.

When unpaused the scheduler will catch up on all the schedules till the pause time

cc : @EngHabu @kumare3 @evalsocket

Goal: What should the final outcome look like, ideally? The scheduler pauses on admin down and unpauses when admin comes back up healthy and catches up on all the schedules.

Describe alternatives you've considered The existing retry mechanism works but it unnecessarily causes all go routines looping in retries

[Optional] Propose: Link/Inline OR Additional context If you have ideas about the implementation please propose the change. If inline keep it short, if larger then you link to an external document.

kumare3 commented 3 years ago

@pmahindrakar-oss I think this would complicate the scheduler and IMO will impact the overall stability. I think we should fundamentally question, why will FlyteAdmin be unavailable. If it were a separate micro-service, you might just introduce a throttle /backpressure handling mechanism in the client right? And If I am correct, you do have an exponential backoff and retry system.

One improvement we can make in the scheduler is limit the number of go-routines that can be spawned and this limit potentially creating a thundering herd attack on Flyteadmin in case of it being down because of Load.

A good pattern for this is what is employed by the controller mechanism in K8s. We can simply use such a Queue + Worker pool mechanism to elegantly control the throughput and create backpressure

github-actions[bot] commented 1 year ago

Hello 👋, This issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will close the issue if we detect no activity in the next 7 days. Thank you for your contribution and understanding! 🙏

github-actions[bot] commented 1 year ago

Hello 👋, This issue has been inactive for over 9 months and hasn't received any updates since it was marked as stale. We'll be closing this issue for now, but if you believe this issue is still relevant, please feel free to reopen it. Thank you for your contribution and understanding! 🙏

github-actions[bot] commented 3 months ago

Hello 👋, this issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will engage on it to decide if it is still applicable. Thank you for your contribution and understanding! 🙏

flyteorg / flyte

[Core Feature] Support pause and unpause in scheduler in case admin is down #1422