Closed artem-shelkovnikov closed 1 week ago
Additional point: we need to check how to work around the ingest pipeline failures and mark documents as "dirty".
@seanstory told me about problems when connector framework is used on a small cluster with ELSER and ingest pipeline is starving, we ingest documents but they are actually "incomplete".
We need to find out what to do with these "incomplete" documents and how we can effectively re-ingest them again.
Stale
Problem Description
We've recently merged a PR to Remove timestamp optimization for full syncs. This PR slows down some connectors quite a lot.
About optimisation:
Connector service before the PR was merged was not ingesting documents into ES, that did not change. A bit of pseudocode to demonstrate:
This optimisation was good for the following reasons:
This resulted in some subsequent syncs taking up to 10 times less time to finish the sync - or even more. Now we don't have it.
Why was optimisation removed?
A couple issues:
Proposed Solution
There are multiple ways to do so, I'll call out a few:
Feature flags
We can make all connectors work with this optimisation together with a feature flag: if connector definition states that "supports_timestamp_optimisation", then this connector can check its feature flag if it's enabled (
connector.features.timestamp.optimisation == True
) and execute it if needed.New sync job type
As suggested by @timgrein, we can separate this logic into "shallow sync" to allow it for each connector and give control of it to the customer
Re-thinking the hierarchy of connectors and allow optimisation to be pluggable
We can change the framework to make such traits be a "plugin" rather than "metadata" of the connector.
For example (really really raw thoughts in my head):
This idea is very raw and I'm still thinking about it, but I feel that extension via composition - building connector from small blocks - rather than inheritance will bring us forward
Alternatives
Do nothing :)