elastic / connectors

Source code for all Elastic connectors, developed by the Search team at Elastic, and home of our Python connector development framework
https://www.elastic.co/guide/en/enterprise-search/master/index.html
Other
65 stars 119 forks source link

Support for multiple schemas in sync #2605

Open mbertani opened 1 month ago

mbertani commented 1 month ago

Problem Description

We need to synchronize data from multiple schemas into one elasticsearch index to provide search to our users. For historical reasons, the data is spread in several schemas, and the current implementation of the connector, only syncs data from one schema.

Currently, the connector just synchronizes data from one schema in PostgreSQL.

Proposed Solution

If the connector's Advanced Rules could support complex queries from multiples schemas, this would solve our problem without refactoring the database.

Alternatives

No alternatives have been identified.

Additional Context

seanstory commented 1 month ago

Hi @mbertani, thanks for filing.

Can you help me understand your use case a bit more? What type of data it is that's stored in separate schemas in postgres, but that you don't want stored across separate indices in Elasticsearch? How you intend to query/aggregate the data? Why related data is separated by schema, and not just different tables in the same schema?

Typically, our suggestion would be to just configure one connector per schema, so that you end up with one destination index per schema. Then build your search experience/tool to query over multiple indices, or if you're using App Search, create a Meta Engine over multiple Engines. Helping us understand why this won't work for you can help us make sure we prioritize appropriately.

mbertani commented 1 month ago

Hi @seanstory , thanks for coming back at me so quickly.

This is a legacy system that has grown over several years. Because of historical reasons the application's data is spread over several schemas. There is a global search in the application that is the central point for users to pick tasks and see the status of these. The domain is a bit complex, and the global search requires over 40 fields spread over 4 schemas and 10+ tables to denormalise the data correctly. Each row of the resulting query needs to match 1 elasticsearch document for the search to work properly.

I have thought about the idea of splitting this query per schema, sync each to elastic, and then use an ingestion pipeline to join the data in the final index. But that seems like a fragile and cumbersome approach.

The database is powering several products that face the same challenges. If we are to successfully move our search to elastic, we need to solve the sync issue in near realtime. Currently we have logstash pipelines running with the jdbc input filter (which can be scheduled with a cron-like setup). This brings some delays that cause problems to the users.

seanstory commented 1 month ago

I have thought about the idea of splitting this query per schema, sync each to elastic, and then use an ingestion pipeline to join the data in the final index. But that seems like a fragile and cumbersome approach.

That's an interesting idea. You're talking about using a reroute processor? I'm not aware of any users of connectors who use a reroute processor in their pipelines. I expect it would prevent you from being able to do incremental syncs, since the connector's expectation of destination index would always be empty. It might be more stable to use a re-index operation to to copy source data from several raw indices into one aggregate index, though you'd still end up with distinct docs from different schemas, where it sounds like you need to denormalize data that's split across schemas to end up in a single Elasticsearch document?

we need to solve the sync issue in near realtime. Currently we have logstash pipelines running with the jdbc input filter (which can be scheduled with a cron-like setup

Before you invest too much in this, I want to make sure you understand that connectors are not typically considered "near real time". They do not have "streaming" or "changelog" capabilities. They are a batch processing (maybe microbatching, with incremental syncs) tool. SQL syncs can be pretty speedy, and might be able to keep your data up-to-date in the minutes/hours range, but I wouldn't expect "near-real-time" of seconds or subseconds. I wouldn't expect connectors to be substantially different from the Logstash JDBC Input in terms of speed or architecture.

If we are to successfully move our search to elastic [...]

I'll get this issue in front of our product managers for prioritization. However, if you have a sales or support contact with Elastic, you may want to also shoot this issue link to them, to make sure we're tracking this enhancement request alongside our customer record for you. Elastic is a business, so customers with paid licenses do tend to have their requests get prioritized first. :)