elastic / connectors-ruby

Official Connector Clients for Elastic Elasticsearch, Enterprise Search, App Search and Workplace Search
https://www.elastic.co/guide/en/enterprise-search/master/index.html
Other
9 stars 17 forks source link

Add checkpoints to the connectors protocol #506

Closed artem-shelkovnikov closed 1 year ago

artem-shelkovnikov commented 1 year ago

Closes https://github.com/elastic/enterprise-search-team/issues/3569

Aim of this PR is to discuss and finalise the protocol for job resilience - checkpoints.

A checkpoint is a moment of sync that connector can continue job from in case of crash. Each connector can define checkpoint individually based on internal implementation.

For example, if connector is using paging to fetch data then the checkpoint can be the following:

{
  "table_A": { "is_finished": true },
  "table_B": { "is_finished": false, skip: 5000 }
}

We also need to store number of attempts we tried to continue from this particular checkpoint so that we could terminate a job if it tries too many times but cannot reach a different checkpoint.

⚠️ Additional note about suspended status. It had incorrect description, so I added the description based on my understanding of it, but I'm not 100% sure in the correctness of it.

sphilipse commented 1 year ago

This is done to avoid mapping conflicts between different service types. Consider two connectors having the following format of checkpoint: ConnectorA: { from => 1000 }; ConnectorB: { from => 2022-10-12 01:00:00+0}. In first case from is a numeric field, in second case it's a date. To avoid mapping problems we can nest these data structures per service type.

This creates a secondary problem when we have multiple types of checkpoints for a single service type ;)

I'd suggest not using the service type as a key and instead using something like 'checkpoint_data' as a key, defining the mapping of that checkpoint data in the jobs index as an object so Elasticsearch doesn't try to auto-detect types of properties.

Although we should probably just turn off automatic mapping on these indices if they aren't already turned off.

artem-shelkovnikov commented 1 year ago

This creates a secondary problem when we have multiple types of checkpoints for a single service type ;)

I'd suggest not using the service type as a key and instead using something like 'checkpoint_data' as a key, defining the mapping of that checkpoint data in the jobs index as an object so Elasticsearch doesn't try to auto-detect types of properties.

Although we should probably just turn off automatic mapping on these indices if they aren't already turned off.

That's indeed the case - our system indices (jobs and connectors) allow for dynamic mapping, which is the problem that I'm trying to work around.

I can indeed fix it by turning off dynamic mapping (looking at https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic.html I just put dynamic: false)

artem-shelkovnikov commented 1 year ago

Closing this PR as these changes do not make any sense any more - see changes in epic https://github.com/elastic/enterprise-search-team/issues/3328