[FEATURE IDEA] Handle expected exceptions gracefully in workflows

mattdurant commented 1 month ago

Is your feature request related to a problem? Please describe. The current architecture for actions does not give a way to allow the user to determine what to do when a "failure" occurs. The current design is good enough for 500 errors where the failure is unexpected, but we're using raise_for_status in a lot of places that will fire on 4xx statuses as well.

As an example, the Okta API will return a 404 if you perform an action on a user ID that doesn't exist. I see many situations where you may have a user that doesn't exist in every system, so I would want to attempt to disable a user if they exist, but if that returns a 404, I wouldn't want to completely stop execution and show an error in the workflow status.

Describe the solution you'd like Some method of putting control into the user's hands for what to do when a "failure" occurs on an action. Low hanging fruit answer would be to add a bool input to have them specify on the action whether to continue_on_failure or something like that and either swallow the exception and return a different value, or let the exception raise normally and halt execution.

Describe alternatives you've considered There was discussion of a future feature that would give the user an alternate path to hook to on failure from the action. I think the issue is determining accurately what a "failure" is. For the Okta example I gave, I would consider a 5xx as a failure, but a 4xx indicates that the action itself was fine, but either the user did not exist, or the user was not in the correct state for the action requested. You may want to suspend a user that has an alert generated for malware activity, but maybe the user is already suspended in Okta. The current architecture would terminate the workflow when you attempt that action.

Additional context

topher-lo commented 1 month ago

Problem Breakdown

We want to gracefully deal with three different action (final) states:

Success
Expected failure
Unexpected failure

Under the hood

A "failure" is equivalent to an exception raised from the Python Action / integration side.
Status quo: any uncaptured exception raised == unexpected failure
Outputs (e.g. http 404 error message and JSON if exists) from "expected failures" are not managed / fowardable to downstream actions in a workflow

Suggestion

Use try except in Actions / Integrations sparingly (i.e. users should have flexibility regarding what is considered an exception, e.g. one person's failed workflow given a 404 might be another person's branch in a playbook).
Extend Tracecat's integrations decorator pattern to support capturing a specified list of exceptions

In Tracecat YAML expressions, you can (open to suggests to naming)

run_if: {{ ACTIONS.action_ref.is_expected_fail }}
...
run_if: {{ ACTIONS.action_ref.is_success }}
...
run_if: {{ ACTIONS.action_ref.is_unexpected_fail }}

For example:

@registry.register(
    default_title="Get Sentinel One agents by username",
    description="Find Sentinel One agent(s) by the last used username field",
    display_group="Sentinel One",
    namespace="integrations.sentinel_one",
    secrets=[sentinel_one_secret],
    # Accepts both lambda (excepts bool as output) or a list of exceptions
    expected_errors=[lambda error: error.response.status_code in ["404", "503"]]
    # or: forwarded_exceptions=[httpx.ReadTimeout, httpx.WriteTimeout]  for example
)
async def get_sentinelone_agents_by_username(
    username: Annotated[str, Field(..., description="Username to search for")],
    exact_match: Annotated[
        bool,
        Field(
            ..., description="Exact match only, otherwise partial matches are returned"
        ),
    ],
) -> list[dict[str, Any]]:
    api_token = os.getenv("SENTINEL_ONE_API_TOKEN")
    base_url = os.getenv("SENTINEL_ONE_BASE_URL")
    headers = {
        "Authorization": f"ApiToken {api_token}",
        "Accept": "application/json",
        "Content-Type": "application/json",
    }
    async with httpx.AsyncClient() as client:
        response = await client.get(
            f"{base_url}/{AGENT_ENDPOINT}?lastLoggedInUserName__contains={username}",
            headers=headers,
        )
        response.raise_for_status()
        results = []
        if exact_match:
            for agent in response.json()["data"]:
                if agent["lastLoggedInUserName"].lower() == username.lower():
                    results.append(agent)
        else:
            results = response.json()["data"]

        return results

mattdurant commented 1 month ago

I like this idea, but I would add that as a user, I may want to ignore even the unexpected errors. Basically, I may have a branching workflow, but not all paths are "critical", meaning if a step in that path fails, I want to continue on and not necessarily end the workflow as errored. I think there is still room here for some user choice on what to do when the step fails; something as simple as a checkbox on every action that says "Continue on fail" and let them choose which steps aren't critical to the success of the workflow.

topher-lo commented 1 month ago

I may have a branching workflow, but not all paths are "critical", meaning if a step in that path fails, I want to continue on and not necessarily end the workflow as errored.

Like this? https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idstepscontinue-on-error

Can add an equivalent UI button in the frontend as well.

mattdurant commented 1 month ago

Yes, exactly!

TracecatHQ / tracecat