TracecatHQ / tracecat

The open source Tines / Splunk SOAR alternative.
https://tracecat.com
GNU Affero General Public License v3.0
2.34k stars 159 forks source link

[FEATURE IDEA] Handle expected exceptions gracefully in workflows #265

Open mattdurant opened 1 month ago

mattdurant commented 1 month ago

Is your feature request related to a problem? Please describe. The current architecture for actions does not give a way to allow the user to determine what to do when a "failure" occurs. The current design is good enough for 500 errors where the failure is unexpected, but we're using raise_for_status in a lot of places that will fire on 4xx statuses as well.

As an example, the Okta API will return a 404 if you perform an action on a user ID that doesn't exist. I see many situations where you may have a user that doesn't exist in every system, so I would want to attempt to disable a user if they exist, but if that returns a 404, I wouldn't want to completely stop execution and show an error in the workflow status.

Describe the solution you'd like Some method of putting control into the user's hands for what to do when a "failure" occurs on an action. Low hanging fruit answer would be to add a bool input to have them specify on the action whether to continue_on_failure or something like that and either swallow the exception and return a different value, or let the exception raise normally and halt execution.

Describe alternatives you've considered There was discussion of a future feature that would give the user an alternate path to hook to on failure from the action. I think the issue is determining accurately what a "failure" is. For the Okta example I gave, I would consider a 5xx as a failure, but a 4xx indicates that the action itself was fine, but either the user did not exist, or the user was not in the correct state for the action requested. You may want to suspend a user that has an alert generated for malware activity, but maybe the user is already suspended in Okta. The current architecture would terminate the workflow when you attempt that action.

Additional context

topher-lo commented 1 month ago

Problem Breakdown

We want to gracefully deal with three different action (final) states:

Under the hood

Suggestion

For example:

@registry.register(
    default_title="Get Sentinel One agents by username",
    description="Find Sentinel One agent(s) by the last used username field",
    display_group="Sentinel One",
    namespace="integrations.sentinel_one",
    secrets=[sentinel_one_secret],
    # Accepts both lambda (excepts bool as output) or a list of exceptions
    expected_errors=[lambda error: error.response.status_code in ["404", "503"]]
    # or: forwarded_exceptions=[httpx.ReadTimeout, httpx.WriteTimeout]  for example
)
async def get_sentinelone_agents_by_username(
    username: Annotated[str, Field(..., description="Username to search for")],
    exact_match: Annotated[
        bool,
        Field(
            ..., description="Exact match only, otherwise partial matches are returned"
        ),
    ],
) -> list[dict[str, Any]]:
    api_token = os.getenv("SENTINEL_ONE_API_TOKEN")
    base_url = os.getenv("SENTINEL_ONE_BASE_URL")
    headers = {
        "Authorization": f"ApiToken {api_token}",
        "Accept": "application/json",
        "Content-Type": "application/json",
    }
    async with httpx.AsyncClient() as client:
        response = await client.get(
            f"{base_url}/{AGENT_ENDPOINT}?lastLoggedInUserName__contains={username}",
            headers=headers,
        )
        response.raise_for_status()
        results = []
        if exact_match:
            for agent in response.json()["data"]:
                if agent["lastLoggedInUserName"].lower() == username.lower():
                    results.append(agent)
        else:
            results = response.json()["data"]

        return results
mattdurant commented 1 month ago

I like this idea, but I would add that as a user, I may want to ignore even the unexpected errors. Basically, I may have a branching workflow, but not all paths are "critical", meaning if a step in that path fails, I want to continue on and not necessarily end the workflow as errored. I think there is still room here for some user choice on what to do when the step fails; something as simple as a checkbox on every action that says "Continue on fail" and let them choose which steps aren't critical to the success of the workflow.

topher-lo commented 1 month ago

I may have a branching workflow, but not all paths are "critical", meaning if a step in that path fails, I want to continue on and not necessarily end the workflow as errored.

Like this? https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idstepscontinue-on-error

Can add an equivalent UI button in the frontend as well.

mattdurant commented 1 month ago

Yes, exactly!