dapr / components-contrib

Community driven, reusable components for distributed apps
Apache License 2.0
544 stars 473 forks source link

Resiliency for Components #1863

Open halspang opened 2 years ago

halspang commented 2 years ago

Describe the proposal

With Resiliency being introduced into the Dapr runtime in 1.7 (as a preview feature), we should now consider how Resiliency interacts with components and if there is a way to get them to utilize the runtime's resiliency.

Tenets

Background

Components across the various building blocks have varying levels of retries. Some components, such as the Redis statestore offer both retries and timeouts. Other components offer just retry toggling like the Kafka pubsub which can enable consumption retries. Further still, some components don't offer traditional retries but rather the service they interact with has their own semantic version of retries. A prime example of this is redelivery count for Azure Service Bus. Exhausting the redelivery count can even cause messages to enter a Dead Letter Queue, which is another avenue of resiliency.

The moral of the story here is that component level retries are vast and varied. What works for one component may not fully cover another, or worse, may break some semantic functionality.

An Important Distinction

Pubsub redelivery (and by extension message binding redelivery) was briefly touched on above. It is worth calling out directly though because it has a distinct difference in functionality as compared to standard retries. Specifically, a redelivery of a message can result in a different host. This is a benefit of the pubsub architecture in that you can be resilient to host failures. Utilizing just the runtime retries here would mean you stuck to a host which, if it was dead, would reduce the delivery rate of your messages. As such, this is something that should be strongly considered and/or maintained while interacting with runtime resiliency.

Impact on Stable Certification

Handling resiliency correctly should become a future bar for becoming a stable component. Any component that is already stable will have it added over time, similar to how we handled certification tests.

Solution

The crux of this design will be a new interface that components will be expected to implement in order to be considered "resilient". The interface would define a few methods for resilient operation:

type Resilient interface {
    // Determine if resiliency should be used in place of component retries and disable/enable as needed.
    modifyComponentBehaviorForResiliency(metadata map[string]string) error
}

By making this into an interface, it forces the implementer to consider the most important point of a resilient component which is how component retries and dapr retries.

It was considered to add a method to the interface that would translate the component errors into a permanent/retryable error which the runtime understands. This was decided against because, upon initial inspection, there are fairly few errors that are actually permanent. Generally, these are bad requests, unauthenticated (handled on init) or perhaps some very specific errors (e.g. etags). With these cases, it's easier to just wrap the error in the code with backoff.Permanent(error).

This also allows for Dapr to determine if a component is resilient or not via type inspection.

Example Component

Using Redis as an example, you would see changes like this:

// Removing the component retry/timeout logic (triggered by property field "daprResiliencyPolicyDefined").
func (r *StateStore) modifyComponentBehaviorForResiliency(metadata map[string]string) error {
    if _, ok := metadata[maxRetries]; ok {
        metadata[maxRetries] = "0"
    }

    if _, ok := metadata[maxRetryBackoff]; ok {
        metadata[maxRetryBackoff] = "-1"
    }

    if _, ok := metadata[readTimeout]; ok {
        metadata[readTimeout] = "-1"
    }

    if _, ok := metadata[writeTimeout]; ok {
        metadata[writeTimeout] = "-1"
    }

    // This is an alias for maxRetries, but it has to be checked as well.
    if _, ok := metadata[redisMaxRetries]; ok {
        metadata[redisMaxRetries] = "0"
    }

    return nil
}

// Example of returning a permanent retry.
func (r *StateStore) setValue(req *state.SetRequest) error {
    err := state.CheckRequestOptions(req.Options)
    if err != nil {
        return err
    }
    ver, err := r.parseETag(req)
    if err != nil {
        return err
    }
    ttl, err := r.parseTTL(req)
    if err != nil {
                 // We will always fail to parse a malformed request.
        return backoff.Permanent(fmt.Errorf("failed to parse ttl from metadata: %s", err))
    }
        // Remainder of function excluded.
}

Example Runtime Init

props := a.convertMetadataItemsToProperties(s.Spec.Metadata)

// Only tell the component to turn off resiliency if a policy is defined for this component.
compType := reflect.TypeOf(store)
if a.resiliency.PolicyDefined(s.Name, resiliency.Component) && compType.Implements(reflect.TypeOf((*Resilient)(nil)).Elem()) {
    props["daprResiliencyPolicyDefined"] = "true"
}

err = store.Init(state.Metadata{
    Properties: props,
})
halspang commented 2 years ago

@berndverst - Any thoughts on this?

berndverst commented 2 years ago

LGTM

Agree that in the future implementation of Resiliency should be a requirement for stable component certification.

So every error not explicitly returned as backoff.Permanent is considered retriable? Sounds good, we will need to evaluate what errors are really permanent and which ones aren't. Could be handy for Azure components for example to create a utility which parses the internal status code and returns backoff.Permanent if the internal code was certain 4XX status codes.

Also as discussed initResiliency(metadata map[string]string) error is a terrible name. It makes more sense to call this modifyComponentBehaviorForResiliency or something like that. The purpose is to disable / tweak component specific retries that are controlled via component metadata so that the resiliency policy can work its magic.

halspang commented 2 years ago

So every error not explicitly returned as backoff.Permanent is considered retriable? Sounds good, we will need to evaluate what errors are really permanent and which ones aren't. Could be handy for Azure components for example to create a utility which parses the internal status code and returns backoff.Permanent if the internal code was certain 4XX status codes.

A lot of this proposal is about forcing people to think about the individual component and what it means to be retryable. I think for components like Azure (and probably GCP/AWS) where they have a commonality in the SDKs will take to a common utility file well.

Also as discussed initResiliency(metadata map[string]string) error is a terrible name. It makes more sense to call this modifyComponentBehaviorForResiliency or something like that. The purpose is to disable / tweak component specific retries that are controlled via component metadata so that the resiliency policy can work its magic.

Yeah, I'm not sure what my initial thought was on the naming there. Updated it in the proposal.

dapr-bot commented 2 years ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged (pinned, good first issue, help wanted or triaged/resolved) or other activity occurs. Thank you for your contributions.

dapr-bot commented 2 years ago

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as pinned, good first issue, help wanted or triaged/resolved. Thank you for your contributions.

berndverst commented 11 months ago

This will need design work for runtime, and then refactorings of every single component which is a lot of work - similar to the effort to do stable certification of components.