argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
15.11k stars 3.21k forks source link

Proposal: `argo cron suspend-all` / `argo cron resume-all` Command #13595

Open hmoulart opened 2 months ago

hmoulart commented 2 months ago

Proposal: argo cron suspend-all / argo cron resume-all Command

Background and Motivation

Currently, Argo Workflows lacks a native way to suspend multiple workflows simultaneously. Users managing large-scale deployments (500+ workflows) face significant time and resource constraints when trying to suspend all workflows, especially in critical situations requiring quick action.

The existing method of suspending workflows individually or through scripted solutions is inefficient and time-consuming, often taking up to 30 minutes for 500+ workflows. This limitation poses a significant operational challenge and may impact system performance and reliability in production environments.

argo cron list -o name | awk '{print "argo cron suspend "$1}' | bash
argo cron list -o name | awk '{print "argo cron resume "$1}' | bash

Note: A similar demand was raised on Apr 11, 2023 (https://github.com/argoproj/argo-workflows/issues/10876), but the proposal was incomplete and did not gain traction. This proposal aims to provide a more comprehensive and actionable solution to address this long-standing need.

Proposed Feature

We propose adding a new command to the Argo CLI: argo cron suspend-all. This command would provide a native, efficient way to suspend multiple workflows at once, significantly improving the management of large-scale Argo Workflows deployments.

Command Syntax

argo cron suspend-all [-n namespace] [--selector=<label-selector>] [--dry-run] [--force]

Parameters

Behavior

  1. The command will identify all running workflows in the specified namespace (or across all namespaces if not specified).
  2. If a selector is provided, it will filter the workflows based on the given labels.
  3. It will display a summary of workflows to be suspended and prompt for confirmation (unless --force is used).
  4. Upon confirmation, it will suspend all identified workflows in parallel, with appropriate rate limiting to prevent API overload.
  5. After completion, it will display a summary of the results (e.g., "Successfully suspended 495/500 workflows").

Example Usage

$ argo cron suspend-all -n production --selector=env=staging

Found 527 running workflows in namespace 'production' matching selector 'env=staging'.

Operation completed. Results:
- Successfully suspended: 525 workflows
- Failed to suspend: 2 workflows
- Errors:
  - workflow-abc: Permission denied
  - workflow-xyz: Workflow already completed

Time taken: 45 seconds

Implementation Considerations

  1. Performance: Implement parallel processing with appropriate rate limiting to handle large numbers of workflows efficiently.
  2. Error Handling: Provide clear error messages and continue processing other workflows if suspension fails for some.
  3. Reversibility: Consider adding a complementary resume-all command for easy reversal of this operation.
  4. Logging: Ensure detailed logs are generated for auditing purposes.
  5. Testing: Include comprehensive unit and integration tests, especially for large-scale scenarios.

Benefits

  1. Significantly reduced time for bulk workflow management operations.
  2. Improved disaster recovery and maintenance procedures.
  3. Enhanced user experience for large-scale Argo Workflows deployments.
  4. Standardized approach to mass workflow suspension, reducing the need for custom scripts.

Next Steps

  1. Gather community feedback on this proposal.
  2. Refine the design based on maintainer and user input.
  3. Create a detailed technical design document if required.
  4. Implement the feature with thorough testing and documentation.

We believe this feature would greatly enhance Argo Workflows' capabilities in managing large-scale deployments and improve overall user experience.

agilgur5 commented 2 months ago

We propose adding a new command to the Argo CLI: argo suspend-all

There is already a keyword @latest in the CLI, so rather than a new command, it could be a new keyword @all

through scripted solutions is inefficient and time-consuming, often taking up to 30 minutes for 500+ workflows.

argo cron list -o name | awk '{print "argo cron suspend "$1}' | bash

If I'm not mistaken, this is being run in serial, hence the slow throughput. Parallelizing this should be substantially faster.

Note: A similar demand was raised on Apr 11, 2023 (#10876)

Also that and your above command are for argo cron suspend, not argo suspend. Please differentiate the two

Behavior

Having a confirmation, progress bar, etc is quite complicated and involved; I would strongly suggest having an MVP for a first pass rather than all these requirements.

Retrieving all Workflows and then suspending one by one for progress reporting is also multiple API requests, whereas nearly all the CLI commands correspond 1-to-1 to an API method. That ensures consistency and simplicity

hmoulart commented 2 months ago

Hello, Thank you for your answer and suggestions! I've updated the description to simplify the behavior. I'm going to try parallelizing the command: argo cron list -o name | xargs -P 4 -I {} argo cron suspend {}

This command uses xargs with the -P flag to run up to 4 parallel processes, which should improve performance when suspending multiple cron workflows. The -I {} flag allows us to use {} as a placeholder for each workflow name. Let me know if you need any further clarification or have any questions about this approach.

agilgur5 commented 2 months ago

I've updated the description to simplify the behavior

It looks like you only removed the progress bar element? I would suggest, rather than removing, split the "Behavior" section with a subsection called "Stretch Goals", and move certain features like progress bar there