V7 Add transient error handling

cronchie commented 3 weeks ago

Draft requirement:

Verify that transient errors are handled with a retry mechanism that includes backoff and jitter, ensuring retries are spaced effectively to prevent overwhelming the system or external service, and avoiding retries for errors that are not transient.

I propose this has an availability impact because without a retry and backoff strategy, clients may repeatedly hammer a server when transient errors (like timeouts or temporary unavailability) occur. This could overwhelm the target service or network, creating a DoS. As for the inverse-- there's no point retrying other kinds of errors (AuthN / AuthZ / 400 Bad Request) and doing so can also impact availability.

ryarmst commented 3 weeks ago

I think this is a good mechanism, but perhaps out of scope. There are many ways an app could be designed such that a sort of self-DoS could occur and there is not an immediate vector to abuse it.

cronchie commented 3 weeks ago

There are many ways an app could be designed such that a sort of self-DoS could occur and there is not an immediate vector to abuse it.

Very true! Though I suggest that this is DoSing an external system. For instance, if a web app keeps retrying an operation with no backoff time, that could impact the service's availability.

elarlang commented 3 weeks ago

potentially related: https://github.com/OWASP/ASVS/issues/1778

tghosth commented 3 weeks ago

@cronchie could you take a look at #1778 and see if the content there is similar to what you mention here?

elarlang commented 2 weeks ago

I rechecked, that this issue is not a duplicate of #1778, although related.

From the issue description:

I propose this has an availability impact because, without a retry and backoff strategy, clients may repeatedly hammer a server when transient errors (like timeouts or temporary unavailability) occur. This could overwhelm the target service or network, creating a DoS. As for the inverse-- there's no point retrying other kinds of errors (AuthN / AuthZ / 400 Bad Request) and doing so can also impact availability.

Point of view: "save 3rd party service from DoS". Based on this description I would say, that every service must defend itself and can not rely on the hope their users are nice.

There are some risks for the application itself, when hammering other services - the application may get blocked from using the service - and this is a clear denial of service for the application itself (that is in scope for ASVS), not (only) for the external service (that is out of scope for ASVS).

How to solve the problem - as it may be done and achieved in many different ways, we just can highlight the problem in a documentation requirement.

We have a documentation requirement:

1.14.7 Verify that all communication needs for the application are documented. This should include external services which the application relies upon and cases where an end user might be able to provide an external location to which the application will then connect.

We can ask "Document how the connection to service should fail, and if retry is allowed, then retry strategy."

In case there is need to create a new documentation requirement for #1778, it may fit there better.

tghosth commented 1 week ago

Ok. I have read back through this and I think that there are two main considerations here:

1) How errors are handled between different components of the application in scope. 2) How the app handles an external service not being available.

I actually think that the new 7.4.4 could handle both these situations with some tailoring.

I would therefore suggest the following:

#	Description	L1	L2	L3	CWE
7.4.4	[ADDED] Verify that the application is designed so that a failure to access internal or external resources does not result in the entire application failing. For example, by using a gradual retry mechanism or the circuit breaker pattern.		✓	✓

Thoughts @cronchie @elarlang ?

elarlang commented 1 week ago

I don't think 7.4.4 should cover the retry issue.

Retry actually works against it - for synchronous flow, as most of the HTTP request-responses are - having retry for connection is just increasing the risk that we try to solve here.

OWASP / ASVS

V7 Add transient error handling #2281