elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.63k stars 24.64k forks source link

[Transform] Reduce indexes to query based on checkpoints #77329

Open hendrikmuhs opened 3 years ago

hendrikmuhs commented 3 years ago

75839 introduced an optimization to reduce the index calls to only necessary indexes. However the approach has a flaw with respect to the delay configuration. See

https://github.com/elastic/elasticsearch/pull/77204#issuecomment-914066744

The change has been "soft"-reverted, meaning the optimization is turned off in TransformIndexer.buildQueryToFindChanges:

-TransformCheckpoint.getChangedIndices(getLastCheckpoint(), getNextCheckpoint()).toArray(new String[0])
+TransformCheckpoint.getChangedIndices(TransformCheckpoint.EMPTY, getNextCheckpoint()).toArray(new String[0])

The extra optimization is latest got reverted, too. See LatestChangeCollector.

To re-enable the optimization TransformCheckpoint.getChangedIndices must take the timeUpperBoundMillis and delay into account. We probably need not only the last checkpoint, but data from previous checkpoints.

It might make sense to this together with https://github.com/elastic/elasticsearch/pull/75839#issuecomment-905542882.

However the underlying root cause is the disconnection between sequence id's and the timestamp field for synchronizing source and dest. If we can base change detection fully on sequence id's, we can re-enable the optimization and do not require extra code. Therefore it might be better to prioritize the switch to sequence id's instead of trying to make this optimization work with the current hybrid.

elasticmachine commented 3 years ago

Pinging @elastic/ml-core (Team:ML)