Informatievlaanderen / registry-documentation

All documentation related to the base registries.
https://informatievlaanderen.github.io/registry-documentation/
1 stars 2 forks source link

Datadog Log Management #5

Open CumpsD opened 3 years ago

CumpsD commented 3 years ago

Logging

Context

We use Datadog for our monitoring requirements. One of these is log management. Our philosophy during development has been to get as much information as possible to Datadog and deal with it there. This means we forward our CloudWatch Fargate logs to Datadog, as well as our Lambda logs and all available AWS services logs (API Gateway, S3, ...).

Since the logging UI of Datadog has a really good search, it has served us very well during development to troubleshoot issues. Over time, we started using more of Datadog features to keep the logging under control. Most of this configuration is concentrated around Pipelines and Indexes.

Pipelines are used to preprocess incoming logs and reshape them or remap them. For example to map WARN or ERROR codes of known functional errors (which are not technical errors) to an OK state. As well as mapping fields to standard field names so the Datadog UI gets more enriched. This is pretty cheap at $0.10/GB ingested logs.

Indexes on the other hand are what is made available on the search, it determines which logs are kept for 15 days. This is the more expensive part at $1.70/million log events.

To keep indexes under control from storing all available logs, we use exclusion filters to get rid of logs that are no interest to us. This is an ongoing task to examine the search UI and determine if there is more to be excluded.

We follow this approach because logging by definition is unpredictable. It is of no use to exclude everything and then include what interests us because we don't always know. Doing this would cause us to miss log events which are interesting but we didn't think about.

Problem

Over the last month usage of our product has taken off and caused the number of requests to rise rapidly. This has caused an influx of log events due to the fact we logged all incoming requests. Some of our clients have made 12 million requests in just a few days, causing 90 million log events to occur and making our Datadog bill rise sharply.

As an emergency measure to save costs, indexing has been disabled. The next steps are to determine more exclusion rules which filter out a lot more log events from the index to keep costs under control.

Additionally we will use Log Archives on S3 to store logging in case of troubleshooting.

Progress