airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
16.27k stars 4.15k forks source link

🐛 Destination Snowflake: Some rows end up having all fields as null in destination #9925

Closed mclate closed 2 years ago

mclate commented 2 years ago

Environment

Current Behavior

We have been running sync jobs between mysql (AWS RDS) and snowflake for quite some time now. One very strange thing we've noticed is that in the destination dataset, rows for some months have all fields resulting in null. To give you brief example, here is the query that was executed in rds and in snowflake after a full sync: select count(distinct d.id) from data d where page is null. RDS result is 0, while Snowflake gave us 442134 (among total of ~25M records).

Below I pasted examples of couple rows as they are in RDS and in Snowflake. First two lines are incorrectly imported, while last two are there as an example of correctly imported ones:

RDS:

id browser crawler_id created_at geo_location_id ip_lookup_ip is_mobile lead_id os page page_type_calculated_status project_id referrer session_id user_agent display_page ope ope_status updated_at page_id
3576583 Chrome 94.0.4606.52 NULL 2021-10-08 18:37:57 1272435 NULL 1 1174769 iPhone OS https://stix.golf/collections/all 0 9 1787199 Mozilla/5.0 (iPhone; CPU iPhone OS 15_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) CriOS/94.0.4606.52 Mobile/15E148 Safari/604.1 /collections/all 0 1 2021-12-06 18:55:46 10
3326765 Safari 15E148 NULL 2021-09-21 13:18:49 1160735 NULL 1 1082862 iPhone OS https://stix.golf/products/the-stix-stand-golf-bag 0 9 1651116 Mozilla/5.0 (iPhone; CPU iPhone OS 14_7_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148 Instagram 205.0.0.20.115 (iPhone11, 2; iOS 14_7_1; en_US; en; scale=3.00; 1125x2436; 317250287) /products/the-stix-stand-golf-bag 0 1 2021-12-06 20:12:02 69
5323937 Android 16.0 NULL 2022-01-30 02:35:09 1806615 NULL 1 1761595 Android https://stix.golf/collections/sets?gclid=Cj0KCQiA6NOPBhCPARIsAHAy2zAAQuclWI_LkzxNKbMmvzNqcXJPPmAE1B273qMl087cQOcP39O_ZRkaAkDdEALw_wcB 0 9 2661189 Mozilla/5.0 (Linux; Android 11; SAMSUNG SM-G986U) AppleWebKit/537.36 (KHTML, like Gecko) SamsungBrowser/16.0 Chrome/92.0.4515.166 Mobile Safari/537.36 /collections/sets 0 1 2022-01-30 03:33:04 14
5323930 Chrome 97.0.4692.99 NULL 2022-01-30 02:34:07 1807578 NULL 0 1763610 Windows https://www.usgboral.com/en_au/whats-new/news-and-promotions/compounds-dry-out.html 0 1 https://www.google.com/ 2661134 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36 /en_au/whats-new/news-and-promotions/compounds-dry-out.html 30 2 2022-01-30 03:32:42 1700

Snowflake:

ID OS OPE PAGE BROWSER LEAD_ID PAGE_ID REFERRER IS_MOBILE CRAWLER_ID CREATED_AT OPE_STATUS PROJECT_ID SESSION_ID UPDATED_AT USER_AGENT DISPLAY_PAGE IP_LOOKUP_IP GEO_LOCATION_ID PAGE_TYPE_CALCULATED_STATUS _AIRBYTE_AB_ID _AIRBYTE_EMITTED_AT _AIRBYTE_NORMALIZED_AT _AIRBYTE_DATA_HASHID _AIRBYTE_UNIQUE_KEY
3576583 2022-01-16T21:36:18Z 2022-01-16T21:36:18Z c581acde-47ae-4b4a-9cbb-f83618af47b7 2022-01-27 16:26:34.824 -0800 2022-01-28 16:29:11.030 +0000 163332a2555b7c028b165991fcb56795 7d8b006b601eedd4565d273bb4437083
3326765 2021-12-19T21:20:52Z 2021-12-19T21:20:52Z 5cf4e70e-653f-4790-b2bd-eda8a9cda07d 2022-01-27 16:26:34.824 -0800 2022-01-28 16:29:11.030 +0000 d2ae69c5df95132c9b4fe1ec0a9b679e 7d910947fc18a436e2b1067876916659
5323937 Android 0 https://stix.golf/collections/sets?gclid=Cj0KCQiA6NOPBhCPARIsAHAy2zAAQuclWI_LkzxNKbMmvzNqcXJPPmAE1B273qMl087cQOcP39O_ZRkaAkDdEALw_wcB Android 16.0 1761595 14 TRUE 2022-01-30T02:35:09Z 1 9 2661189 2022-01-30T03:33:04Z Mozilla/5.0 (Linux; Android 11; SAMSUNG SM-G986U) AppleWebKit/537.36 (KHTML, like Gecko) SamsungBrowser/16.0 Chrome/92.0.4515.166 Mobile Safari/537.36 /collections/sets 1806615 0 29638d78-32ad-4f78-817a-a6e21b2ff02b 2022-01-30 15:24:15.162 -0800 2022-01-31 10:33:23.919 +0000 1218582a611e70d46615c798efbe785a 81be4e73eae871f6c54f9a847cf302d0 2661189 1761595 2819 9 2022-01-30T03:05:42Z 0 2022-01-30T02:35:09Z 2022-01-30T02:35:09Z 1806615 5323937 1f03b829-fe82-49c0-8639-275357379465 2022-01-30 15:24:15.162 -0800 2022-01-31 10:33:22.832 +0000 dfdcc65e91c44818c5ae6cbefb49f97b 1b7b5811c36993c8c97d7940f25b2321
5323930 Windows 30 https://www.usgboral.com/en_au/whats-new/news-and-promotions/compounds-dry-out.html Chrome 97.0.4692.99 1763610 1700 https://www.google.com/ FALSE 2022-01-30T02:34:07Z 2 1 2661134 2022-01-30T03:32:42Z Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36 /en_au/whats-new/news-and-promotions/compounds-dry-out.html 1807578 0 9098ca4b-2cec-41af-a5ad-4d75f16177fe 2022-01-30 15:24:15.162 -0800 2022-01-31 10:33:23.919 +0000 9b25fe5f4a67d712ac420658c0b6e6ea a55bbe9ab540bd9fba96ecd2befda4a7 2661134 1763610 2 1 2022-01-30T03:32:50Z 30 2022-01-30T02:34:07Z 2022-01-30T02:34:07Z 1807578 5323930 5091fd34-a593-4c2c-ab57-71047e529685 2022-01-30 15:24:15.162 -0800 2022-01-31 10:33:22.832 +0000 14e43a67ac909a27ce7ac890aad44d91 ca3a26ab707005af47f5a6459f1892eb

The most confusing part is that this happens only for couple months of data - all other months are ok (the data itself is quite consistent in the source)

In Snowflake configuration we were initially using S3 COPY method, however, last full sync was done without it, and it still contains incorrect records

Expected Behavior

All field to be exported correctly

Logs

At this point I'm not sure what logs would be relevant for this issue. Let me know what additional details would be helpful.

cc @mewis

alafanechere commented 2 years ago

Hi @mclate, could you please check if in your raw data table the page field is also null for some records? I'd like to understand if it's a normalization-related problem or a replication problem.

mewis commented 2 years ago

Hi @alafanechere . I work with @mclate so can hopefully answer that. I just checked but seems there are no results in the raw table that match the _AIRBYTE_AB_ID field.

I tried

select * from _AIRBYTE_RAW_DATA
where _AIRBYTE_AB_ID = 'c581acde-47ae-4b4a-9cbb-f83618af47b7'

and got 0 results.

when running

select * from data 
where _AIRBYTE_AB_ID = 'c581acde-47ae-4b4a-9cbb-f83618af47b7'

I do get a result, although it is one of the broken results.

alafanechere commented 2 years ago

This discrepancy between raw data and normalized data is not a good sign, do you mind resetting the data for this connection and running a full refresh again?

mewis commented 2 years ago

@alafanechere I shall set that off now, will take a while.

What would you like me to check once it's done?

alafanechere commented 2 years ago

I'd like to check if you still face the same problem, your raw table should not miss records that are present in normalized tables.

mewis commented 2 years ago

Hi @alafanechere it looks sorted. The strange thing is myself and @mclate did this several times the other day, each time had the same result.

One thing I did differently although don't know if this would effect anything is I reset the data, had a look in snowflake and saw that there was a schema airbyte_{schema-name} as well as the schema without airbyte in front of it.. I dropped this schema and the newly created one, hit refresh again and saw only the target schema and not the airbyte_ one. I then did a sync and it seems to have worked.