cloudflare / complainer

Complainer's job is to send notifications to different services when tasks fail on Mesos cluster.
MIT License
81 stars 13 forks source link

Complainer

Complainer's job is to send notifications to different services when tasks fail on Mesos cluster. While your system should be reliable to failures of individual tasks, it's nice to know when things fail and why.

Supported log upload services:

Supported reporting services:

Quick start

Start sending all failures to Sentry:

docker run -it --rm cloudflare/complainer \
  -masters=http://mesos.master:5050 \
  -uploader=noop \
  -reporters=sentry \
  -sentry.dsn=https://foo:bar@sentry.dsn.here/8

Run this on Mesos itself!

Sentry screenshot

Reporting configuration

Complainer needs two command line flags to configure itself:

These settings can be applied by env vars as well:

Filtering based on the failures framework

If you're in the situation where you have multiple marathons running against a mesos, and want to segregate out which failures go where, the following options are of interest. Each option can be specified multiple times.

Note that the order of evaluation is such that blacklists are applied first, then whitelists.

HTTP interface

Complainer provides HTTP interface. You have to enable it with -listen command line flag or with COMPLAINER_LISTEN env variable.

This interface is used for the following:

Health checks

/health endpoint reports 200 OK when things are operating mostly normally and 500 Internal Server Error when complainer cannot talk to Mesos.

We don't check for other issues (uploader and reporter failures) because they are not guaranteed to be happening continuously to recover themselves.

version endpoint

/version endpoint reports 200 OK and outputs the current version of this application.

complainer (default) v1.7.0

pprof endpoint

/debug/pprof endpoint exposes the regular net/http/pprof interface:

Log upload services

Log upload service is specified by command line flag uploader. Alternatively you can specify this by env var COMPLAINER_UPLOADER. Only one uploader can be specified per complainer instance.

no-op

Uploader name: noop

No-op uploader just echoes Mesos slave sandbox URLs.

S3 AWS

Uploader name: s3aws.

This uploader uses official AWS SDK and should be used if you use AWS.

Stdout and stderr logs get uploaded to S3 and signed URLs provided to reporters. Logs are uploaded into the following directory structure by default:

Command line flags:

You can set value of any command line flag via environment variable. Example:

Flags override env variables if both are supplied.

The minimum AWS policy for complainer is s3:PutObject:

S3 Compatible APIs

Uploader name: s3goamz.

This uploader uses goamz package and supports S3 compatible APIs that use v2 style signatures. This includes Ceph Rados Gateway.

Stdout and stderr logs get uploaded to S3 and signed URLs provided to reporters. Logs are uploaded into the following directory structure by default:

You can set value of any command line flag via environment variable. Example:

Flags override env variables if both are supplied.

Reporting services

Reporting services are specified by command line flag reporters. Alternatively you can specify this by env var COMPLAINER_REPORTERS. Several services can be specified, separated by comma.

Sentry

Command line flags:

Labels:

If label is unspecified, command line flag value is used.

Hipchat

Command line flags:

Labels:

If label is unspecified, command line flag value is used.

Templates are based on text/template. The following fields are available:

Slack

Command line flags:

Labels:

If label is unspecified, command line flag value is used.

For more details see Slack API docs.

Templates are based on text/template. The following fields are available:

Jira

Command line flags:

Example jira.fields:

Project:COMPLAINER;Issue Type:Bug;Summary:Task {{ .failure.Name }} died with status {{ .failure.State }};Description:[stdout|{{ .stdoutURL }}], [stderr|{{ .stderrURL }}], ID={{ .failure.ID }}

Templates are based on text/template. The following fields are available:

File

Command line flags:

Templates are based on text/template. The following fields are available:

Label configuration

Basics

To support flexible notification system, Mesos task labels are used. Marathon task labels get copied to Mesos labels, so these are equivalent.

The minimal set of labels needed is an empty set. You can configure default values in Complainer's command line flags and get all notifications with these settings. In practice, you might want to have different reporters for different apps.

Full format for complainer label name looks like this:

Example (dsn set for default Sentry of default Complainer):

This is long and complex, so default parts can be skipped:

Advanced labels

The reason for having long label name version is to add the flexibility. Imagine you want to report app failures to the internal Sentry, two internal Hipchat rooms (default and project-specific) and the external Sentry.

Set of labels would look like this:

Internal and external complainers can have different upload services.

Implicit instances are different, depending on how you run Complainer.

The latter is useful for opt-in monitoring, including monitoring of Complainer itself (also known as dogfooding).

Templating

Templates are based on text/template. The following fields are available:

With config you can use labels in templates. For example, the following template for the Slack reporter:

Task {{ .failure.Name }} ({{ .failure.ID }}) died | {{ config "mentions" }}{{ .nl }}

With the label complainer_slack_mentions=@devs will be evaluated to:

Task foo.bar (bar.foo.123) died | @devs

Dogfooding

To report errors for complainer itself you need to run two instances:

You'll need the following labels for the default instance:

labels:
  complainer_dogfood_sentry_instances: default
  complainer_dogfood_hipchat_instances: default

For the dogfood instance you'll need to:

Since the dogfood Complainer ignores apps with not configured instances, it will ignore every failure except for the default instance failures.

If the dogfood instance fails, default reports it just like any other task.

If both instances fail at the same time, you get nothing.

Copyright

License

MIT