airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
14.9k stars 3.83k forks source link

Source MSSQL: CDC commit logs are not fetched properly #25237

Open uuboyscy opened 1 year ago

uuboyscy commented 1 year ago

Environment

Current Behavior

I integrate data from SQL Server 2017 with CDC enable to BigQuery, and configure "Incremental Deduped+History" on Airbyte. Some of data are not integrated correctly. These data would be modified multiple times in SQL Server in almost the same time, and the wrong order of commit would be kept in target destination. As shown in following BigQuery photo, I rerun the SQLs generated when Airbyte syncing, found that the order would lead to wrong row of data kept.

Expected Behavior

Keep correct data in target destination. As Bigquery screenshot shown below, the correct data is the row with TC027 = 'Y', but the one with TC027 = 'U' is kept now. Further, maybe Airbyte does not use column __$command_id in MSSQL CDC log screenshot to order the latest commit, so that the older commit is update to target destination.

Logs

Steps to Reproduce

As shown in above Bigquery screenshot, I rerun the SQLs generated when Airbyte syncing, found that the order would lead to wrong row of data kept (The data with TC027 = 'U' is kept, but TC027 = 'Y' is correct row).

HonzaPavel commented 1 year ago

I have a very similar problem. I have a connection between MS SQL Server and BigQuery. I have set up CDC on MS SQL Server. I am using Incremental Sync - Deduped History. SCD historization is not working correctly because CDC on MS SQL Server uses local time zone instead of UTC. In BigQuery the difference is 2 hours. Isn't that the reason for your problems? Is it not possible to set the time zone change to UTC in Airbyte?

image

uuboyscy commented 1 year ago

Hi @HonzaPavel

It is one of problems. This also cause that timezone of the first update would be your local timezone, but later incremental update become UTC. I'm not sure why it happen, I think it is a bug of Debezium.

To solve this problem for short-term, as you can see my reply in this post, using k8tz to simply set global timezone for all newly created pods can solve the timezone issue if you establish Airbyte on K8s. But the key problem is source-mssql-connector. Add timezone configuration to this Dockerfile, and execute gradle command here to build custom source-connector image, then you can solve the key problem.

For another problem you can refer to this PR. Source-mssql-connector missed change_lsn, which you can find in Debezium document, so that the basic-normalization fail to sort the latest and correct CDC commit. If you also have this problem, try changes in the PR and re-build a custom source-mssql-connector image via gradle command mentioned above.

Even if you fix the problems mentioned above, destination data may still incorrect because of basic-normalization not using change_lsn to sort data. Then all you need to do is modify stream_processor.py so that the basic-normalization process can generate correct DBT model.

HonzaPavel commented 1 year ago

Hi @uuboyscy, It is more complicated than I thought. Changing the time zone in the Dockerfile makes sense. Thank you. I will try it. Shouldn't the other problems be fixed in Airbyte? I think they're bugs. I'd like to help, but I don't have enough experience contributing to open source projects. Have you tried adding these changes to the Airbyte code?

visma-henna-pekkala commented 8 months ago

Hey! I just found this issue and I think it might be the same that has caused trouble for us, which I have described here! Has the fix you described above worked properly for the time being? Unfortunately, the docker link cannot be found.

plenti-jacob-roe commented 5 months ago

Hi @HonzaPavel

It is one of problems. This also cause that timezone of the first update would be your local timezone, but later incremental update become UTC. I'm not sure why it happen, I think it is a bug of Debezium.

To solve this problem for short-term, as you can see my reply in this post, using k8tz to simply set global timezone for all newly created pods can solve the timezone issue if you establish Airbyte on K8s. But the key problem is source-mssql-connector. Add timezone configuration to this Dockerfile, and execute gradle command here to build custom source-connector image, then you can solve the key problem.

For another problem you can refer to this PR. Source-mssql-connector missed change_lsn, which you can find in Debezium document, so that the basic-normalization fail to sort the latest and correct CDC commit. If you also have this problem, try changes in the PR and re-build a custom source-mssql-connector image via gradle command mentioned above.

Even if you fix the problems mentioned above, destination data may still incorrect because of basic-normalization not using change_lsn to sort data. Then all you need to do is modify stream_processor.py so that the basic-normalization process can generate correct DBT model.

in case people are still having this issue we had a similar issue not including change_lsn is the cause of the problem since multiple records share the same commit_lsn and same timestamp. https://github.com/airbytehq/airbyte/pull/24206 had the fix for the problem and we are running this code in a custom connector and it solves the problem. Unfortunately it was closed and no fix has been implemented so if you have a transaction in sql server with multiple changes to the same row you can get incorrect data in the destination. @prateekmukhedkar is there a planned fix for this, your last comment on the pr seems to leave it unanswered.