airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
16.07k stars 4.11k forks source link

Destination Redshift: normalization result too long for type varchar #14441

Closed validumitru closed 9 months ago

validumitru commented 2 years ago

Environment

Current Behavior

Normalization of tickets sync into Redshift is failing with the error:

Database Error in model tickets_scd (models/generated/airbyte_incremental/scd/zendesk/tickets_scd.sql)
2022-07-06 08:52:11 normalization > 08:52:07.635374 [error] [MainThread]:   Invalid input
2022-07-06 08:52:11 normalization > 08:52:07.635591 [error] [MainThread]:   DETAIL:  
2022-07-06 08:52:11 normalization > 08:52:07.635809 [error] [MainThread]:     -----------------------------------------------
2022-07-06 08:52:11 normalization > 08:52:07.636024 [error] [MainThread]:     error:  Invalid input
2022-07-06 08:52:11 normalization > 08:52:07.636236 [error] [MainThread]:     code:      8001
2022-07-06 08:52:11 normalization > 08:52:07.636448 [error] [MainThread]:     context:   CONCAT() result too long for type varchar(65535)
2022-07-06 08:52:11 normalization > 08:52:07.636660 [error] [MainThread]:     query:     962421
2022-07-06 08:52:11 normalization > 08:52:07.636863 [error] [MainThread]:     location:  string_ops.cpp:110
2022-07-06 08:52:11 normalization > 08:52:07.637019 [error] [MainThread]:     process:   query0_127_962421 [pid=27739]

It looks like it's probably the description field in the ticket that might be longer than 65k bytes.

Expected Behavior

Could the description field be a super type ?

alafanechere commented 2 years ago

Hey @validumitru, in the Zendesk ticket stream schema the description field is declared as a string. The normalization process maps the string type to a VARCHAR type on Redshift. I think that normalization currently supports SUPER type for redshift but only for object field in schema, not string. I think we could make try to make the normalization smarter so that in defines the SUPER type for long string but I'm not sure to what extent it's something feasible. @grubberr I saw you contributed to the SUPER support on normalization (https://github.com/airbytehq/airbyte/pull/12064). Do you have an opinion on this topic?

validumitru commented 2 years ago

Thanks for looking at this. Have managed to find some of the tickets that have long descriptions in Zendesk and it seems like the descriptions are actually 65535 kbytes (Zendesk might actually cut everything over 65k). So wondering why this error occurs, as you can see the errors says something about CONCAT() so wondering if the normalization attempts to do a CONCAT and then the result is too big ?

A setting that would cut the strings at 65k would also work for us in this case. Just something that would allow us to do the sync and wouldn't fail the whole sync because of a few records..

grubberr commented 2 years ago

@alafanechere yes it works as you described - SUPER type used only for objects. In theory I think it possible to do - I mean to use SUPER type for string more then 65k.

is it good idea in general I don't know

alafanechere commented 2 years ago

CONCAT and then the result is too big ?

I think CONCAT is used to concatenate the value of each field and compute a hash of the fields to use in hashid column. @validumitru Could you please share your full sync logs, we might find more details about the context in which is error is raised.

alafanechere commented 2 years ago

@validumitru I found a duplicate issue (https://github.com/airbytehq/airbyte/issues/13734), downgrading the redshifht connector to 0.3.28 could be a workaround for you.

validumitru commented 2 years ago

@alafanechere Thanks again for looking into this. I'm attaching the logs with the error. Good to know downgrading would be a workaround, even though it sounds a bit scary to be stuck with that version... logs-338.txt

ingo-m commented 2 years ago

FYI I encounter the same problem (CONCAT() result too long for type varchar(65535)) with mysql to redshift, after upgrading to redshift destination connector version 0.3.46 (previsouly using 0.3.25 without this problem).

validumitru commented 2 years ago

Can confirm this is also happening with a BigQuery->Redshift connection, is there an ETA on this one guys ?

natalyjazzviolin commented 2 years ago

Hey, @validumitru! @alafanechere has switched to a different team and is no longer monitoring this issue. I'll work on getting you an ETA early next week!

cybermaxs commented 2 years ago

Hey, I have the same issue with Marketo source. Please find the logs logs-557.txt As said before, I noticed that it's not possible to query the marketo_leads_stg table because of the unique id column. I suppose this technique has troubles when the stream has a lot of long fields.

danielmay-whoop commented 2 years ago

I'm having this issue with Salesforce source connected to Redshift.

@natalyjazzviolin did you ever come up with an ETA?

natalyjazzviolin commented 2 years ago

@danielmay-whoop no ETA yet, but the connector team is aware of the issue!

osalloum commented 2 years ago

If the redshift connector enables an init connection configuration

One can just set string truncation to solve the issue

SET json_parse_truncate_strings=ON;

jan-benisek commented 2 years ago

Also encountered the issue:

2022-08-31 12:59:47 normalization > -- depends_on: __dbt__cte__invoicesdb_ab2
2022-08-31 12:59:47 normalization > select
2022-08-31 12:59:47 normalization >     md5(cast(coalesce(cast(ip as text), '') || '-' || ... <LOT OF COLUMNS> .. || '-' ||  coalesce(cast(country as text), '') as text)) as _airbyte_invoicesdb_hashid,
2022-08-31 12:59:47 normalization >     tmp.*
2022-08-31 12:59:47 normalization > from __dbt__cte__invoicesdb_ab2 tmp

Basically during normalisation in dbt, Airflow is adding a column _airbyte_invoicesdb_hashid which is a concat of all the fileds. And if you have too many fields, you will eventually hit the limit. (Airbyte 0.39.35, from mysql to Redshift)

osalloum commented 2 years ago

json_parse_truncate_strings would still fix the problem, the hash would be inaccurate but at least it would work

you can test it out using

SELECT JSON_SERIALIZE(JSON_PARSE('[10001,10002,"a' || repeat('a', 70000) || 'bc"]'));

vs

SET json_parse_truncate_strings=ON;
SELECT JSON_SERIALIZE(JSON_PARSE('[10001,10002,"a' || repeat('a', 70000) || 'bc"]'));
osalloum commented 2 years ago

In any case it would be super helpful to be able to set other init statement like datestyle, extra_float_digits, etc https://docs.aws.amazon.com/redshift/latest/dg/cm_chap_ConfigurationRef.html#t_Modifying_the_default_settings

validumitru commented 2 years ago

This is such a show stopper for us, 3 of our connectors (Zendesk, Hubspot and Intercom) are all affected by this error and there is no way to proceed..

ingo-m commented 2 years ago

This is such a show stopper for us, 3 of our connectors (Zendesk, Hubspot and Intercom) are all affected by this error and there is no way to proceed..

Same here, we gave up on trying to work around and have switched to a different solution (not involving airbyte) for now. But I would like to revert back to airbyte if it becomes more stable.

natalyjazzviolin commented 2 years ago

@danielmay-whoop @ingo-m @validumitru @jan-benisek @osalloum Sorry to hear that this has impacted so many workspaces! There are no immediate plans to fix this issue as the team is currently focused on database connectors. This is something we can consider as part of planning Q4.

If you are able to submit a PR, you are of course welcome to.

mattyweb commented 2 years ago

hi @natalyjazzviolin, any updates here? It sounded like the Airbyte team was going to work on this in Q4? My company is blocked due to this issue. Getting the Redshift destination fully functional would be at the top of our wish list.

In the meantime, what is the official guidance? To downgrade the Redshift connector? I see someone saying version 0.2.28 and someone else saying 0.3.28. Which should we use? And is there guidance on how to do a connector downgrade?

Thx.

mattyweb commented 2 years ago

As @jan-benisek says, the problem seems to be in the hashid generation on the _stg table during normalization. I have data partially loaded there and if I query the table without the _hashid column it works fine. Apparently, it needs to concat all columns in a row together as a VARCHAR before hashing and Redshift doesn't like long VARCHARs.

SELECT MD5(CAST(((((((((((((((((((((((((((((((((((COALESCE(CAST("tmp"."id" AS VARCHAR), CAST('' AS VARCHAR)) || CAST('-' AS VARCHAR)) || COALESCE("tmp"."title", CAST('' AS VARCHAR))) || CAST('-' AS VARCHAR)) || COALESCE(CAST("tmp"."user_id" AS VARCHAR), CAST('' AS VARCHAR))) || CAST('-' AS VARCHAR)) || COALESCE(CAST("tmp"."version" AS VARCHAR), CAST('' AS VARCHAR))) || CAST('-' AS VARCHAR)) || COALESCE("tmp"."location", CAST('' AS VARCHAR))) || 
...lots of other stuff...
 || COALESCE(CAST("tmp"."created_at" AS VARCHAR), CAST('' AS VARCHAR))) || CAST('-' AS VARCHAR)) || COALESCE(CASE WHEN "tmp"."is_deleted" THEN CAST('true' AS VARCHAR) ELSE CAST('false' AS VARCHAR) END, CAST('' AS VARCHAR))) || CAST('-' AS VARCHAR)) || COALESCE(CASE WHEN "tmp"."is_private" THEN CAST('true' AS VARCHAR) ELSE CAST('false' AS VARCHAR) END, CAST('' AS VARCHAR))) || CAST('-' AS VARCHAR)) || COALESCE(CAST("tmp"."updated_at" AS VARCHAR), CAST('' AS VARCHAR))) AS VARCHAR)) AS "_airbyte_mylog_hashid", ...more columns

As it happens, I don't need all my source columns. Something that would make this less of a problem would be the ability to choose which columns to sync (https://github.com/airbytehq/airbyte/issues/9771) but until that's an option I can at least create a view in the source DB that trims my table down just to the columns I need and sync from that.

But this doesn't help people who are syncing from other sources they don't control (e.g. Zendesk). A potential workaround for this case I can see is manually editing the _hashid logic on the _stg table after it's created to remove some columns from the hash. But that's pretty hacky.

Anyway, I think Airbyte needs to figure out the solution. Allowing only a subset of columns to be selected from a source would definitely help. Otherwise, maybe it could detect this hash problem and just trim? Or allow users to decide which columns to include in the hash?

Would love to know if there are other workarounds I'm not thinking of too.

etsybaev commented 1 year ago

New onCall issue seems to be related to this issue https://github.com/airbytehq/alpha-beta-issues/issues/711

RobinPicard commented 1 year ago

New onCall issue seems to be related to this issue https://github.com/airbytehq/alpha-beta-issues/issues/711

The link is broken. Any news about this issue? It makes Airbyte completely unusable for us

willycerezo commented 1 year ago

The same here, we have a problem pulling out a MySQL that fails when trying to save in a Super field of a redshift table. Do you have any updates on this? Thanks in advance and regards

edgao commented 1 year ago

I think issue looks similar (possibly identical) to https://github.com/airbytehq/airbyte/issues/16412 ? There's been some discussion there, though we're still figuring out how to actually make the change in a way that doesn't break existing users' data.

though also - column selection is a thing we're also actively working on, which sounds like would be a workaround for some of you.

willycerezo commented 1 year ago

Thanks for the info @edgao. Do you have some ETA about the column section? No rush and thanks for the info, regards

edgao commented 1 year ago

ETA about the column selection

we don't have one right now, no

erichartono commented 1 year ago

Column selection would be a great feature as a workaround to this issue. Hope we can have this feature soon in Q1 2023.

validumitru commented 1 year ago

Still no workaround here ?

osalloum commented 1 year ago

Probably the fix is going to come from AWS itself by increase the varchar max size sooner than this is released here

For now, I am synchronizing to AWS S3 and using dbt glue to build the deduplicated datasets via hudi tables

Auric-Manteo commented 1 year ago

I am running into the same issues with multiple databases. I would love to help with this.

To better understand the problemspace - What is the hashid used for in the first place? Maybe it is not necessary at all?

tobiascadee commented 1 year ago

Also experiencing this problem with the Intercom source connector -> Redshift. Would be great if one of the solutions mentioned above would be implemented soon!

williamlinckdocket commented 1 year ago

Hi everyone. Just got this error in a zendesk<>redshift connection. Any expectations about fixing it?

FredericoCoelhoNunes commented 1 year ago

Just got this error when using a Metabase -> Redshift connection. The error message appears to be truncated both on the Airbyte logs and on Redshift, so not sure how to diagnose/fix the issue. Any ideas?

Error:

JSON_PARSE() error: String value exceeds the max size of 65535 bytes: (119205 bytes skipped) ":null}}}}]","collection (136 bytes skipped)                                                                                                                                                                                                                                                                                                                                                                                       
ingo-m commented 1 year ago

@FredericoCoelhoNunes also have a look at this: https://github.com/airbytehq/airbyte/issues/16412

This is an old issue and so far there's no indication that it will be resolved. Personally, I would recommend not to use airbyte if you're working with a redshift destination. There are workarounds, but if you don't have control over your source data, you're probably out of luck; you will have unexpected and unpredictable errors. Connections that work at first can break when new data exceeding the character limit is added to the source.

Auric-Manteo commented 1 year ago

@FredericoCoelhoNunes With the new version of Airbyte you can at least filter out columns that are too long before they get sent to Redshift. So if you don't need all columns that would probably be the best solution. The other solution is to maintain your own dbt normalization which is very tedious.

FredericoCoelhoNunes commented 1 year ago

@Auric-Manteo @ingo-m Thanks for sharing your experience! I need to first figure out which is the problem column, as I was not able to discover it in the logs; but it's possible that we might be able to cut it from the replication.

rohaldb commented 12 months ago

Having the same issue with shopyify - going to use fivetran until this is fixed

evantahler commented 9 months ago

This will be handled soon by https://github.com/airbytehq/airbyte/issues/28541

edgao commented 9 months ago

and for those of you who were having issues with a concat result being too long, rather than any single column value being too long - try out redshift dv2! It rewrites normalization to avoid the concat entirely.

meyerovb commented 8 months ago

@edgao @evantahler will #28541 fix the issue that a specific super field can not exceed 65k characters? im having issues in klaviyo email templates (html template string is over 65k characters)

edgao commented 8 months ago

@meyerovb yep! that's the exact problem it's solving.

meyerovb commented 8 months ago

@meyerovb yep! that's the exact problem it's solving.

Super… eta? I had to make a lambda job for now that reprocesses the s3 target files before I read it in spectrum.

edgao commented 8 months ago

no exact eta as yet, but we're actively working on it - most likely next couple months. There are a couple dependent things we need to build first (destination-side state management for better migrations, airbyte_meta in raw tables for raw data lineage / error tracking)