dapr / components-contrib

Community driven, reusable components for distributed apps
Apache License 2.0
533 stars 466 forks source link

Standardize error messages in Dapr #1916

Open mukundansundar opened 1 year ago

mukundansundar commented 1 year ago

Describe the proposal

For different building blocks the errors during init phase are surfaced in different ways. The following were the error logs when tested with component files with no required metadata for proper initialization of the component.

Note: The line error processing component, daprd process will exit gracefully is a log line that is there for all init failures across building blocks.

State Store

## CosmosDB

WARN[0000] error initializing state store statestore (state.azure.cosmosdb/v1): url is required  app_id=test-error instance=Mukundans-MacBook-Pro.local scope=dapr.runtime type=log ver=1.8.0
WARN[0000] error processing component, daprd process will exit gracefully  app_id=test-error instance=Mukundans-MacBook-Pro.local scope=dapr.runtime type=log ver=1.8.0
## Cassandra

WARN[0000] error initializing state store statestore (state.cassandra/v1): missing or empty hosts field from metadata  app_id=test-error instance=Mukundans-MacBook-Pro.local scope=dapr.runtime type=log ver=1.8.0
WARN[0000] error processing component, daprd process will exit gracefully  app_id=test-error instance=Mukundans-MacBook-Pro.local scope=dapr.runtime type=log ver=1.8.0

Note: For most of the building blocks the part error initializing state store statestore (state.cassandra/v1): differs by building block and the phrase after : is the one that is returned from the component code in components-contrib.

Pubsub

## Hazelcast

WARN[0000] error initializing pub sub pubsub.hazelcast/v1: hazelcast error: missing hazelcast servers  app_id=test-error instance=Mukundans-MacBook-Pro.local scope=dapr.runtime type=log ver=1.8.0
WARN[0000] error processing component, daprd process will exit gracefully  app_id=test-error instance=Mukundans-MacBook-Pro.local scope=dapr.runtime type=log ver=1.8.0
## Pulsar

WARN[0000] error initializing pub sub pubsub.pulsar/v1: pulsar error: missing pulsar host  app_id=test-error instance=Mukundans-MacBook-Pro.local scope=dapr.runtime type=log ver=1.8.0
WARN[0000] error processing component, daprd process will exit gracefully  app_id=test-error instance=Mukundans-MacBook-Pro.local scope=dapr.runtime type=log ver=1.8.0

Bindings

## AZ Blobstorage binding 

WARN[0005] error processing component, daprd process will exit gracefully  app_id=test-error instance=Mukundans-MacBook-Pro.local scope=dapr.runtime type=log ver=1.8.0

Note: For all bindings errors, the log similar to error initializing state store statestore (state.cassandra/v1): missing or empty hosts field from metadata is missing. Only the generic component processing error is seen.

Nameresolution

## Consul 
ERRO[0000] failed to initialize name resolution resolver consul: failed to register consul service: Put "http://127.0.0.1:8500/v1/agent/service/register": dial tcp 127.0.0.1:8500: connect: connection refused  app_id=test-error instance=Mukundans-MacBook-Pro.local scope=dapr.runtime type=log ver=1.8.0
WARN[0000] failed to init name resolution: failed to register consul service: Put "http://127.0.0.1:8500/v1/agent/service/register": dial tcp 127.0.0.1:8500: connect: connection refused  app_id=test-error instance=Mukundans-MacBook-Pro.local scope=dapr.runtime type=log ver=1.8.0

Note: For Consul, its a bit different when required config parameters is not present or the service is not running for the init code to connect, we see a duplication of log messages, one at ERROR level and one at WARNING level.

These are the error messages that we see in the different components across different building blocks in components-contrib during the init phase.

For Publish errors, sometimes there are cases where we return a formatted error in the component code and in some cases it is just returned directly to be processed by runtime code.

For Statestore, sometimes there sepcific error wrap scenarios as seen here, but for most cases, the errors are formatted as wanted by the component and sometimes the errors are returned directly as received from the client.

In some cases what we are seeing is that:

  1. There incomplete standardization wrt logging and error messages during Init and shutdown.
  2. There is no standardization across all components/building blocks on what the error message should contain (except the statestore Etag and BulkDelete are formatted consistently)
  3. For built in retries, for some components, there are warning messages printed on a retry and for some that is not the case. And also on various components the log levels for these messages are completely different.

The issue here is further complicated when in the runtime, when gRPC API is used, the error is formatted and returned to the user in adifferent form whereas when HTTP is used, it is translated to HTTP status codes based responses with the error code (strings like ERR_CONFIGURATION_STORE_NOT_CONFIGURED) and message forming the part of the HTTP body.

The proposal in this issue is to make the flow of returning errors consistent across all components with slight differences with respect to the different building blocks.

Similar to Heroku, and similar to the partial solution that we already have for error codes, can we have a complete set of error codes, short description and hooks to create full fledged error responses based on the different lifecycle/API methods that we have for each building block?

For example each building block contains an Init() and Close() lifecycle methods (mostly all), and some other methods that are called during the init lifecycle is Read(), Subscribe() etc. which then spins up a go routine to handle input bindings or pubsub subscriptions.

For these lifecycle methods, we need to clearly emit metric and log when there are errors during the init phase and return clear formatted errors when response is sent back to the application.

For other methods where the API calls are user initiated, and the user waits for a response, clear formatted error messages would be better.

Similar to the how Etag Errors defined as seen below:

// ETagError is a custom error type for etag exceptions.
type ETagError struct {
  err  error
  kind ETagErrorKind
}

We can define one for all error scenarieos like:

type DaprErrorCode string

ErrInitFailure DaprErrorCode = "D001"
ErrClose DaprErrorCode = "D100"
ErrPublishFailure DaprErrorCode = "D201"
// ... 

type DaprError struct { 
  err              error
  code             DaprErrorCode
  shortDescription string
}

Where code is specific for certain error scenarios and is internal to Dapr, this will help with debug and support. The shortDescription can be a small description on what the error is, for example ERR_CONFIGURATION_STORE_NOT_CONFIGURED. And the err can be what we get back from a component. This struct will implement the Error interface and provide consistent formatting for error messages.

Additionally, we can define utility functions either based on each API/lifecycle method, for providing generic error codes and descriptions where the user would have to just call that method with the actual error they have to be wrapped into the DaprError format.

yaron2 commented 1 year ago

For these lifecycle methods, we need to clearly emit metric and log

Not sure about how useful this is in terms of metrics. Logs make sense.

dapr-bot commented 1 year ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged (pinned, good first issue, help wanted or triaged/resolved) or other activity occurs. Thank you for your contributions.

dapr-bot commented 1 year ago

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as pinned, good first issue, help wanted or triaged/resolved. Thank you for your contributions.

berndverst commented 1 year ago

We should define a Dapr error interface and then more specific Error structs which implement that interface. Then we can update our Dapr component interfaces to return the Dapr Error interface. This should help us standardize how we are returning errors.

The Dapr struct can provide its own string template for how the error should be returned. This is more consistent.

Of course this could be complemented with the error codes you outline @mukundansundar.