artursouza commented 4 years ago

Longhaul and Chaos Testing

Introduction

Dapr is thriving towards a stable version. It is crucial for Dapr to be problem reliable in an application that runs 24/7. Before a real application can be deployed, this confidence can be achieved by building, deploying and operating such application in a controlled chaotic environment.

Test Application

The test application will simulate messages posted in a social network to be scored via sentiment analysis. No external dependency is taken for greater control of the environment. Some components could be removed, and the same result achieved. On the other hand, this design is purposely exercising all Dapr’s building blocks. It is recommended that all components in this app to be implemented using the same repository and programming language for speedy development. Since this application makes use of the Actor feature as well, it can be written in .Net or Java. Given the current project maintainers are more familiar with C#, .Net SDK with C# should be used. The repository should be separate from existing ones. It is recommended a new repository to be created, named “longhaul-tests”.

Feed Stream Generator

Generates artificial social network message posts, such as: “Dapr is great. #DaprRocks #Kubernetes”. These messages will be auto-generated in a predefined template: “ is . <Hashtag 1> <Hashtag 2>” The list of nouns and adjectives is predefined and randomly picked. Same for list of hashtags. The message gets a randomly generated messageId and correlationId using UUID generator and published using Dapr’s PubSub API in the following format:

{
  "correlationId": "<UUID>",
  "messageId": "<UUID>",
  "message": "<message>",
  "creationDate": "<creationDate>"
}

Message Analyzer

This component subscribed to a topic via Dap’s PubSub feature, looks up the map of adjectives to sentiment type (positive, neutral, negative) and uses the identified type (or unknown, if not found) and append that content to the message. Finally publishes the new labeled payload via Dapr’s output binding API. The labeled payload is in the following format:

{
  "correlationId": "<UUID>",
  "messageId": "<UUID>",
  "message": "<message>",
  "sentiment": "<sentiment type>",
  "creationDate": "<creationDate>"
}

Hashtag Counter

This component will receive a message via Dapr’s input binding call. The hashtags are extracted from the message. For each hashtag identified, it makes an Actor method call: method increment(sentiment) in actor instance identified as HashtagActor.< HashTag >.

Hashtag Actor Service

This component is useful to exercise the Actor feature in Dapr. It registers the HashtagActor type, where the hashtag is the identifier. This actor has one method: increment(String sentiment), whose goal is to keep a counter per hashtag-sentiment combination. The sentiment passed in the state key and the state value is the previous value (zero if not found) incremented by 1.

Hashtag Snapshot Service

This component will exercise Dapr’s state APIs (not in the context of an Actor). It wakes up every minute and retrieves all keys from Redis state store - not using Dapr’s state APIs, since Dapr does not offer an API to query a range of states from another Dapr application’s state store. Is is expected that only a few dozen keys to be present since the list of hashtags are predefined in this component. Now key-value pairs are generated for all states and saved via Dapr’s state store API. This service also offers an API to retrieve all keys via a GET method.

Validation Worker

This component will do health checks on the application’s results. The validation must be fuzzy given the eventual consistency and failures artificially injected. The worker should perform the following validation:

Wake up every 5 minutes.
Fetch all key-value pairs by invoking API on Hashtag Snapshot Service.
Sleep for 2 minutes.
Fetch all key-value pairs by invoking API on Hashtag Snapshot Service.
Calculate ratio of how many counters have changed.
Emit a metric as JSON to stdout: { "longhaul-counters-changeratio": "<ratio>"}

Dashboard Web App

This is a simple web page that will make an API call to Hashtag Snapshot Service, displaying all key-value pairs. This is useful for manual validation. Optionally, this component can also validate OAuth feature via Dapr’s middleware.

Failure Daemon

Last but not least, this service will trigger failures given a fixed configuration. The failure types and the specific failure configuration are described later in this document.

Platform, Logs and Metrics

The longhaul test app will be deployed using an AKS cluster with at least 1 node on each of the 3 AZs. Since the goal is to test resiliency and not performance and the traffic is artificially generated, a cheap hardware type should be enough, like Standard DS2 v2 (2 vcpus, 7 GiB memory), for example. Logs and metrics are forwarded to Azure Monitor and can be queries as structured data via JSON.

Failure Types

To simulate a chaotic environment, some artificial failures will be injected. Restarts can be achieved by scaling service from 3 to 0 and then from 0 to 3. When a single POD is expected (placement service, for example), the rescaling should be from/to 1 instead.

Application container crash

To simulate an app that crashes (process exits), any container is this system will be restarted in an interval. It is important to notice that Dapr’s sidecar is expected to continue to run. It is expected that the container will be restarted gracefully and Dapr’s sidecar will restore communication with the application without manual intervention.

POD crash

To simulate a situation for a given POD is unhealthy, a service’s POD in the system will be restarted in an interval. This is a partial failure, which means the service should continue to operate while a new POD is restored by Kubernetes. It is expected that Kubernetes restores the service to a healthy state again and Dapr sidecar from other services will be able to communicate to all PODs in the restored service.

Service crash

This failure simulates a complete outage of a service by restarting all PODs of a service. This will cause the Validation Worker to potentially identify a complete outage. It is expected that Kubernetes restores the service to a healthy state again and Dapr sidecar from other services will be able to communicate to all PODs in the restored service.

State store outage

State stores can be down for any reason. To simulate that, all PODs for Redis will be restarted at an interval.

State store slowness

State stores might have its performance degraded by a busy neighbor or other external factors. This is simulated by making write operations to Redis at X tps for Y seconds at an internal. Some slowness is expected in the data processing but recovered after burst is over.

Topic outage

Topics can be down for any reason. This will be simulated by restarting all PODs for Kafka at an interval.

Topic slowness

Topics can have its throughput reduced because of another topic that is collocated and receiving a traffic spike. Slowness can also be caused by other external factors. To simulate this, a random topic ios created with replication set to 3 (guarantee all nodes have a copy of the data) and traffic is maintained at X tps for a duration of Y seconds at an interval. Some slowness is expected in the data processing but recovered after burst is over.

Dapr’s sidecar injector crash

After simulating this failure with the steps below, the data processing should continue and all PODs should have Dapr sidecar.

Scale a service from 3 to 0.
Wait for service to be at 0.
Restart Dapr’s sidecar injector.
Scale service from 0 back to 3.

Dapr’s placement service crash

This is simulated by restarting placement service at an interval.

Dapr’s sentry service crash

This is simulated by restarting sentry service at an interval.

Actor instantiation burst

Some applications might create many actors in a small amount of time. This burst will be simulated by creating actors of a random type and activating it at a fixed rate of X tps for a durantion D at an interval. The flooded actor type must be different than the actor type used in the app but it should also be registered by Hashtag Actor Service to make sure that service gets traffic load. Some slowness is expected in the data processing but recovered after burst is over.

Failure Configuration

The Failure Daemon will be configured to execute the following pattern for 1 hours every other hour (i.e., one hour active and 1 hour idle).

Feed Stream Generator’s container crash every 2 minutes.
Message Analyzer’s container crash every 3 minutes.
Hashtag Counter’s container crash every 4 minutes.
Hashtag Actor Service’s container crash every 5 minutes.
Hashtag Counter’s POD crash every 9 minutes.
Hashtag Actor Service’s POD crash every 10 minutes.
Message Analyzer’s service crash every 7 minutes.
State store outage every 25 minutes.
State store slowness for 1 minute every 29 minutes (tps to be defined during implementation).
Topic outage every 21 minutes.
Topic slowness for 1 minute every 23 minutes.
Dapr’s sidecar injector crash with Hashtag Snapshot Service every 13 minutes.
Dapr’s placement service crash with every 5 minutes.
Dapr’s sentry service crash with every 7 minutes.
Actor’s instantiation burst for 1 minute every 10 minutes (tps to be defined during implementation).

In case all the failures above do not prove practical together in real world, the Failure Daemon can randomly choose a subset of the failure configurations above (5, for example) and execute only those in a given run.

Test Validation

The test validation happens via alerts on monitors in Azure Monitor that trigger sev3s. The following monitors will be configured and should always remain healthy:

Data processing

Validation worker’s change ratio metric should never be zero for two consecutive data points. This metric is emitted by Validation Worker.

Message Analyzer delay

Message analyzer must publish a metric for delay since message creation. No message should be older than 2 minutes. This metric is emitted by Message Analyzer.

Hashtag Counter delay

Hashtag counter must publish a metric for delay since message creation. No message should be older than 4 minutes. This metric is emitted by Hashtag Counter.

Stale snapshot

Even though Hashtag Snapshot Service is working, the last snapshot might be too old. Hashtag Snapshot Service should publish a metric on delay since last successful run. The delay should never be greater than 5 minutes. This metric can be emitted by Hashtag Snapshot Service.

Service health

Complete outages can be detected with other alarms. To detect partial failures, no service can have less than 3 healthy PODs for more than 50 minutes. This metric can be emitted by Failure Daemon.

Generic error count spike

Alert on spike of error count. The exact values will be determined during implementation.

No errors

The error count should not be greater than zero for more than 70 minutes (i.e., 10 minutes into the healthy hour).

Haishi2016 commented 4 years ago

Chaos tests are often used to test large distributed systems to ensure the overall system behavior and state are not broken under various error conditions. The test here is more relevant to the application itself than the individual components. I doubt if we could get the returns we expected by running such chaos tests because our main interest is not the application itself but Dapr sidecars, which I think can be tested with more controlled, regular test cases with fault injections.

artursouza commented 4 years ago

The goal is to test how an application using dapr would work using Dapr. You are right that this test is targeting the app - this is intentional. This is a customer focused approach, where the successful behavior of the test is defined by the correct behavior of the app. We are reproducing a scenario before our customers - this is the value proposition here.

Haishi2016 commented 4 years ago

My point is validating such a fictional application doesn't bring us as much as value as we would expect. In the future when the user has a complex application that behaves strangely because of some unclear inter-dependencies, our chaos test will not help them in any way. Dapr sidecars don't have inter-dependencies. For the sidecar reliability itself, we can test with much simpler test settings. I can't think of a single Dapr sidecar problem that can't be validated with regular test cases without the chaos tests.

msfussell commented 3 years ago

dapr / test-infra