airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
15.58k stars 4.01k forks source link

Destination Elasticsearch: full_refresh - overwrite generate duplicates #16182

Open marcosmarxm opened 2 years ago

marcosmarxm commented 2 years ago

This Github issue is synchronized with Zendesk:

Ticket ID: #2084 Priority: normal Group: Community Assistance Engineer Assignee: Marcos Marx

Original ticket description:

I’m experimenting between the different syncModes and I would have imagined if I ingested 100 records into Elasticsearch and I had the sync set to full_refresh - overwrite, then when I ran the sync again it would delete the index and ingest the same 100 records? However, I get 200 records in the index.

I’m running version 0.1.3.

Taking a quick look through the code (Maybe around here: airbyte/ElasticsearchAirbyteMessageConsumerFactory.java at 09aa685aadbb6dc45f66b2c87d4c13af9b88e6b2 · airbytehq/airbyte · GitHub), I didn’t see it delete the index based on syncMode.

Wanted to check with people that might know more because the docs say that the full_refresh syncMode is supported.

[Discourse post]
marcosmarxm commented 2 years ago

Comment made from Zendesk by Marcos Marx on 2022-08-31 at 18:39:

Thanks for reporting this Ryan I opened the issue https://github.com/airbytehq/airbyte/issues/16182 to solve the issue.
wirelessrpm commented 2 years ago

I've written a fix for this on my fork, but not sure where to aim it when submitting the PR.

marcosmarxm commented 2 years ago

@wirelessrpm please read Airbyte's docs about contributing to the project: https://docs.airbyte.com/contributing-to-airbyte/

joelluijmes commented 1 year ago

FYI: I just also ran into this. I dug a bit in the code, and the OVERWRITE mode only works when the Upsert Records checkbox is deselected (default it is on). When it is off, it performs a sync to a temporary index which replaces the original index. I suppose to have some form of atomic replacement to mitigate any downtime.

Not sure if this is the most obvious solution, but it does seem to work 👍

TLDR: can be fixed by disabling the Upsert Records config.

khiem20tc commented 1 year ago

FYI: I just also ran into this. I dug a bit in the code, and the OVERWRITE mode only works when the Upsert Records checkbox is deselected (default it is on). When it is off, it performs a sync to a temporary index which replaces the original index. I suppose to have some form of atomic replacement to mitigate any downtime.

Not sure if this is the most obvious solution, but it does seem to work 👍

TLDR: can be fixed by disabling the Upsert Records config.

Hi sir, how to disable Upsert Records to use Overwrite mode with ElasticSearch? Where can I can config in Airbyte source code or in ElasticSearch config?

joelluijmes commented 1 year ago

’Upsert Records’ is just a configuration setting in Airbyte when setting up the connection. You don't need to change the code.

khiem20tc commented 1 year ago

Hi Sir, I don't see that config during the setup of my connections. Plz show me where can I find that configuration? Many thanks!