airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
15.48k stars 3.99k forks source link

JSONB columns break Snowflake destination #3283

Closed sherifnada closed 3 years ago

sherifnada commented 3 years ago

Description

We don't make guarantees about properly encoding binary data in sources or destinations. In the case of #3251, sending a JSONB column from Postgres to Snowflake caused it to break. This is likely happening for other database sources/destinations as well but I didn't verify this.

We should come up with a way for sources which could output binary data to do so in a JSON Schema compliant way (potentially base64 encoding). The design should also allow destinations and normalization to write this data to the destination in a way that makes sense. For example, for destinations which support JSONB we should encode this appropriately. Otherwise we should maintain the base64 string encoding. Ideally this can be encoded into the Abstract JDBC abstraction to reduce wasted work but we'll have to apply judgment.

Logs

An example issue can be found in #3251 . Logs: airbyte_issue_2021-05-05.log

Steps to Reproduce

  1. Setup replication from postgres to snowflake or postgres with normalization turned on and with a JSONB column in the source
  2. replicate the JSONB column
  3. 💣

Severity of the bug for you

Medium

royt-via commented 3 years ago

hi @sherifnada , any estimation to when this issue will be fixed? still scheduled for 2021-05-14?

sherifnada commented 3 years ago

@davinchia any input on ETA?

davinchia commented 3 years ago

I believe I can have something working by Monday.

@royt-via if it's possible do you mind sharing with me one or two JSONB entries that are causing the issue for you? I want to make sure this nails it. Thanks.

royt-via commented 3 years ago

hey @davinchia, I'm trying to full-sync a table with some JSONB columns so I wouldn't know if/which specific JSONB object is causing this error. most of them are pretty standard JSONs with some flat fields. for example

{
    "department_id": "c1c3d3f9-958b-45be-a6cf-ed7dc59ec73c",
    "message": "Hello"
}
davinchia commented 3 years ago

@momer do you by any chance have the offending json object? I want to make sure this fix gets it.

davinchia commented 3 years ago

@royt-via @momer I've not been able to reproduce this locally so far.

I've set up a Postgres table with JSONB column and successfully synced to a Snowflake database. These are the following JSON forms I'm currently syncing:

INSERT INTO public.jsonb_table VALUES (1, '{"name": "Paint house", "tags": ["Improvements", "Office"], "finished": true}');
INSERT INTO public.jsonb_table VALUES (2, '{"name": "Wash dishes", "tags": ["Clean", "Kitchen"], "finished": false}');
INSERT INTO public.jsonb_table VALUES (3, '{"name": "Cook lunch", "tags": ["Cook", "Kitchen", "Tacos"], "ingredients": ["Tortillas", "Guacamole"], "finished": false}');
INSERT INTO public.jsonb_table VALUES (4, '{"name": "Vacuum", "tags": ["Clean", "Bedroom", "Office"], "finished": false}');
INSERT INTO public.jsonb_table VALUES (5, '{"name": "Hang paintings", "tags": ["Improvements", "Office"], "finished": false}');
INSERT INTO public.jsonb_table VALUES (6, '{
  "streams": [
    {
      "name": "accounts",
      "json_schema": {
        "$schema": "http://json-schema.org/draft-07/schema#",
        "type": "object",
        "properties": {
          "ownerId": {
            "type": "integer"
          },
          "name": {
            "type": "string"
          },
          "domain": {
            "type": "string"
          },
          "accountId": {
            "type": "string"
          },
          "customProperties": {
            "type": "array",
            "items": {
              "type": "object",
              "properties": {
                "label": {
                  "type": "string"
                },
                "name": {
                  "type": "string"
                },
                "value": {},
                "type": {
                  "type": "string"
                }
              }
            }
          },
          "deleted": {
            "type": "boolean"
          },
          "createDateTime": {
            "type": "integer"
          },
          "updateDateTime": {
            "type": "integer"
          },
          "targeted": {
            "type": "boolean"
          }
        }
      }
    },
    {
      "name": "conversations",
      "json_schema": {
        "$schema": "http://json-schema.org/draft-07/schema#",
        "type": "object",
        "properties": {
          "id": {
            "type": "integer"
          },
          "participants": {
            "type": "array",
            "items": {
              "type": "integer"
            }
          },
          "status": {
            "type": "string",
            "enum": ["open", "closed", "pending", "bulk_sent"]
          },
          "contactId": {
            "type": "integer"
          },
          "inboxId": {
            "type": "integer"
          },
          "createdAt": {
            "type": "integer"
          },
          "updatedAt": {
            "type": "integer"
          },
          "relatedPlaybookId": {
            "type": ["null", "string"]
          },
          "conversationTags": {
            "type": "array",
            "items": {
              "type": "object",
              "properties": {
                "color": {
                  "type": "string",
                  "description": "HEX value"
                },
                "name": {
                  "type": "string"
                }
              }
            }
          }
        }
      }
    },
    {
      "name": "users",
      "json_schema": {
        "$schema": "http://json-schema.org/draft-07/schema#",
        "type": "object",
        "properties": {
          "id": {
            "type": "integer"
          },
          "orgId": {
            "type": "integer"
          },
          "name": {
            "type": "string"
          },
          "alias": {
            "type": "string"
          },
          "email": {
            "type": "string"
          },
          "phone": {
            "type": "string"
          },
          "locale": {
            "type": "string"
          },
          "availability": {
            "type": "string"
          },
          "role": {
            "type": "string"
          },
          "timeZone": {
            "type": "string"
          },
          "avatarUrl": {
            "type": "string"
          },
          "verified": {
            "type": "boolean"
          },
          "bot": {
            "type": "boolean"
          },
          "createdAt": {
            "type": "integer"
          },
          "updatedAt": {
            "type": "integer"
          }
        }
      }
    }
  ]
}
');

INSERT INTO public.jsonb_table VALUES (7, '[ {"type": "mobile", "phone": "001001"} , {"type": "fix", "phone": "002002"} ]');
INSERT INTO public.jsonb_table VALUES (8, '[ {"type": "mobile", "phone": "001001"} , {"type": "fix", "phone": "00200\\asd\\qwe\\asdf/asdfblahblah2"} ]');
INSERT INTO public.jsonb_table VALUES (9, '{"type": "mobile", "phone": "001001   "}');
INSERT INTO public.jsonb_table VALUES (10, '{"type": "mobile", "phone": "001001   ", "object": {"type": "mobile", "phone": "001001   "} }');
INSERT INTO public.jsonb_table VALUES (11, '{"type": "mobile", "phone": "\\n\\r \r\n \n \r \\036" }');
INSERT INTO public.jsonb_table VALUES (12, '{"type": "mobile", "phone": "\\n\\r \r\n \n \r \\000a036"}');

There are a variety of JSON types to make sure this works. I've also deliberately tried to insert invalid JSON into the jsonb table. However, Postgres is rejecting (rightfully so) the invalid JSON (e.g. hidden characters etc).

Before I push any fix, I want to make sure I'm solving the right problem. Since I'm unable to reproduce this locally, can I trouble you to run an image I've prepared that has additional logging around JSON objects and specific exception catching around the points I think this error is happening at? The code I've added can be seen this in this PR. To do so please update your source-postgres and destination-snowflake images to use the debug-logging tag.

Since this logging is verbose, you might need to add disk space to the machines the job is being run on.

@royt-via from the provided logs, it looks like your error is originating from a record at the end of the table. I believe running the job on the last 20,000 records in the table will catch our offender. This might save you some time.

Thanks guys!

royt-via commented 3 years ago

@davinchia , sure! I've just ran it and I'm confused because it succeeded (7.44 GB | 24,003,992 records | 1h 6m 4s | Sync) and I don't see any tables/records on the destination schema.. I double checked all setting. Here are the logs of this execution.

davinchia commented 3 years ago

Thanks @royt-via! From your logs it does seem like the sync succeeded. All the JSONB stuff looks good, so at least that's working well.

I do see a snowflake error right at the end - perhaps that affected things? So select * is not showing new tabes?

Snowflake has a panel that allows one to see the queries ran previously. Can you navigate there and see what queries were run? There should be a COPY into a tmp table, an INSERT into the final table, a few normalisation queries, and a DELETE for the tmp table.

royt-via commented 3 years ago

Thanks @davinchia! I was able to locate the data somewhere else. Opened another ticket for it.

davinchia commented 3 years ago

Sounds good. I'm going to close this for now since it seems that the underlying issue is solved. The work to properly load Postgres JSONB columns will be tracked as part of #1006.

Feel free to reopen if you see this again!