[Transform] Provide visibility into missed documents due to a low value for sync delay

hmdhk commented 3 years ago

Currently the data transform allows the configuration of sync.time.delay to account for the document ingest delay. However, the ingest delay might vary depending on load and deployment configuration. And if the sync delay has value that is too low, the data transform might miss some documents that would otherwise be included in the source query.

At the moment there's no visibility if any documents were missed due to the delay configuration. Having this visibility, would help with managing capacity as well as properly setting the sync delay configuration.

cc @hendrikmuhs @benwtrent

elasticmachine commented 3 years ago

Pinging @elastic/ml-core (Team:ML)

hendrikmuhs commented 3 years ago

There are 2 ways to think about this:

A (retrospective) API to detect if data has been missed

Given the checkpoints that are stored it is possible to build something like this by extracting the stored sequence id and timestamp information and craft queries that are expected to return 0 results. However it is necessary to make 1 separate query per shard.

Avoid missing documents

However, the ingest delay might vary depending on load and deployment configuration. And if the sync delay has value that is too low, the data transform might miss some documents that would otherwise be included in the source query.

In most cases this is a setup problem. We encourage to use an ingest timestamp, this will reduce the potential problem to the Lucene level. With an ingest timestamp added as last step of an ingest pipeline sync.delay can be set to a slightly higher value than the index refresh interval. It's unlikely that Lucene breaches the SLA and in case it does, it means the cluster is in trouble. Note that a search operation triggers a refresh if necessary, search blocks until the refresh has happened and operates on the fresh data.

Ingest timestamps avoids any issues with clock skews, queuing problems etc.

For using an ingest timestamp and a date_histogram as the only way to pivot we encourage to use at least version 7.11 (see #63315), otherwise there is no limitation on using an ingest timestamp.

LBNL: An ingest timestamp allows you to lower sync.delay and therefore makes new data available faster.

Related: #1242

elastic / elasticsearch

[Transform] Provide visibility into missed documents due to a low value for sync delay #70563

A (retrospective) API to detect if data has been missed

Avoid missing documents