airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
16.07k stars 4.11k forks source link

Source Salesforce: creates duplicates after update #20471

Closed paul-chrlt closed 1 year ago

paul-chrlt commented 1 year ago

Environment

Current Behavior

Duplicated rows are created since Salesforce connector update. Sync mode is set to Full Refresh | Overwrite

Expected Behavior

There were no duplicates using Salesforce 1.0.2 The only difference is the Salesforce connector update.

Steps to Reproduce

  1. Set a replication for a salesforce object containing enough entities (30k), Sync mode to Full refresh | Overwrite
  2. Set Salesforce to version 1.0.2 and sync --> no duplicates
  3. Set Salesforce to version 1.0.27 and sync --> duplicates (around 30)
SophieLohezic commented 1 year ago

Hello. We still have this issue. Have you had the chance to look at it yet ? We are stuck on salesforce 1.0.2. connector as all following connector versions create duplicates on identifiers for one of our tables (tested on 2.0.1, 2.0.5 and 2.0.6 recently). Many thanks for your help.

poolmaster commented 1 year ago

Ran into the same issue that full_refresh always creates duplicate. It seems to me this is a bug on this line: https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-salesforce/source_salesforce/streams.py#L476

We should not use >= to include the rows from previous batch

arsenlosenko commented 1 year ago

@paul-chrlt Hi, can you please elaborate with which streams you experiencing this problem, because from testing that I performed I was not able to reproduce this problem exactly. There is a PR that should add checkpointing to bulk streams (https://github.com/airbytehq/airbyte/pull/24888), which might resolve the issue you are having, but need to be sure if you still have this problem on latest version of Salesforce connector. Also I created a PR with fixes that @poolmaster proposed (https://github.com/airbytehq/airbyte/pull/24779), but need to be sure if this indeed covers the case with streams that you have problems with.

paul-chrlt commented 1 year ago

Hi @arsenlosenko We are experiencing this issue on streams with highest number of records:

The other streams have less than 10k records and we never had duplicates. We will try the latest version of Salesforce connector, I keep you informed.

arsenlosenko commented 1 year ago

@paul-chrlt Hi, thanks for clarification, we will let you know when the changes that should resolve this issue ((https://github.com/airbytehq/airbyte/pull/24888) are merged, so you could try to sync the streams in question again

roman-yermilov-gl commented 1 year ago

Fixed: https://github.com/airbytehq/airbyte/pull/24888