Open hmdhk opened 3 years ago
Pinging @elastic/ml-core (Team:ML)
There are 2 ways to think about this:
Given the checkpoints that are stored it is possible to build something like this by extracting the stored sequence id and timestamp information and craft queries that are expected to return 0 results. However it is necessary to make 1 separate query per shard.
However, the ingest delay might vary depending on load and deployment configuration. And if the sync delay has value that is too low, the data transform might miss some documents that would otherwise be included in the source query.
In most cases this is a setup problem. We encourage to use an ingest timestamp, this will reduce the potential problem to the Lucene level. With an ingest timestamp added as last step of an ingest pipeline sync.delay
can be set to a slightly higher value than the index refresh interval. It's unlikely that Lucene breaches the SLA and in case it does, it means the cluster is in trouble. Note that a search operation triggers a refresh if necessary, search blocks until the refresh has happened and operates on the fresh data.
Ingest timestamps avoids any issues with clock skews, queuing problems etc.
For using an ingest timestamp and a date_histogram
as the only way to pivot
we encourage to use at least version 7.11 (see #63315), otherwise there is no limitation on using an ingest timestamp.
LBNL: An ingest timestamp allows you to lower sync.delay
and therefore makes new data available faster.
Related: #1242
Currently the data transform allows the configuration of
sync.time.delay
to account for the document ingest delay. However, the ingest delay might vary depending on load and deployment configuration. And if the sync delay has value that is too low, the data transform might miss some documents that would otherwise be included in the source query.At the moment there's no visibility if any documents were missed due to the delay configuration. Having this visibility, would help with managing capacity as well as properly setting the sync delay configuration.
cc @hendrikmuhs @benwtrent