elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.72k stars 24.67k forks source link

[Transform] `latest` transform skipping some source documents #106363

Open przemekwitek opened 6 months ago

przemekwitek commented 6 months ago

Elasticsearch Version

8.13

Installed Plugins

No response

Java Version

bundled

OS Version

MacOS

Problem Description

Latest transform was reported to skip some source documents.

I identified 2 potential issues:

  1. When there are multiple source documents with the same @timestamp value, the latest transform only picks one of them.
  2. sync.time.delay field does not seem to influence the filter range queries issued by the latest transform.

Ad 1.: This is how we build the range query in the code:

        // We are only interested in documents that were created in the timeline of the current checkpoint.
        // Older documents cannot influence the transform results as we require the sort field values to change monotonically over time.
        return QueryBuilders.rangeQuery(synchronizationField)
            .gte(lastCheckpoint.getTimeUpperBound())
            .lt(nextCheckpoint.getTimeUpperBound())
            .format("epoch_millis");

So I think it can be that because of this lt the documents that have the same timestamp as the document that was already involved in the checkpoint will not get processed. This should be taken care of by the time.sync.delay but apparently it doesn't work in this case (Ad 2.)

Steps to Reproduce

This has been reproduced by the Kibana team (https://github.com/elastic/security-team/issues/8893). Now I'm working on reproducing it locally.

Logs (if relevant)

No response

elasticsearchmachine commented 6 months ago

Pinging @elastic/ml-core (Team:ML)

syepes commented 3 months ago

If there currently any workarounds or setting that could be adjusted? In our use case records in between checkpoint / executions must never be skipped.

syepes commented 4 weeks ago

Any news or version ETA on this issue?