Closed validumitru closed 9 months ago
Hey @validumitru, in the Zendesk ticket stream schema the description
field is declared as a string.
The normalization process maps the string type to a VARCHAR type on Redshift.
I think that normalization currently supports SUPER type for redshift but only for object
field in schema, not string
.
I think we could make try to make the normalization smarter so that in defines the SUPER type for long string but I'm not sure to what extent it's something feasible. @grubberr I saw you contributed to the SUPER support on normalization (https://github.com/airbytehq/airbyte/pull/12064). Do you have an opinion on this topic?
Thanks for looking at this. Have managed to find some of the tickets that have long descriptions in Zendesk and it seems like the descriptions are actually 65535 kbytes (Zendesk might actually cut everything over 65k). So wondering why this error occurs, as you can see the errors says something about CONCAT() so wondering if the normalization attempts to do a CONCAT and then the result is too big ?
A setting that would cut the strings at 65k would also work for us in this case. Just something that would allow us to do the sync and wouldn't fail the whole sync because of a few records..
@alafanechere yes it works as you described - SUPER type used only for objects
.
In theory I think it possible to do - I mean to use SUPER type for string more then 65k.
is it good idea in general I don't know
CONCAT and then the result is too big ?
I think CONCAT
is used to concatenate the value of each field and compute a hash of the fields to use in hashid
column. @validumitru Could you please share your full sync logs, we might find more details about the context in which is error is raised.
@validumitru I found a duplicate issue (https://github.com/airbytehq/airbyte/issues/13734), downgrading the redshifht connector to 0.3.28
could be a workaround for you.
@alafanechere Thanks again for looking into this. I'm attaching the logs with the error. Good to know downgrading would be a workaround, even though it sounds a bit scary to be stuck with that version... logs-338.txt
FYI I encounter the same problem (CONCAT() result too long for type varchar(65535)
) with mysql to redshift, after upgrading to redshift destination connector version 0.3.46 (previsouly using 0.3.25 without this problem).
Can confirm this is also happening with a BigQuery->Redshift connection, is there an ETA on this one guys ?
Hey, @validumitru! @alafanechere has switched to a different team and is no longer monitoring this issue. I'll work on getting you an ETA early next week!
Hey, I have the same issue with Marketo source. Please find the logs logs-557.txt As said before, I noticed that it's not possible to query the marketo_leads_stg table because of the unique id column. I suppose this technique has troubles when the stream has a lot of long fields.
I'm having this issue with Salesforce source connected to Redshift.
@natalyjazzviolin did you ever come up with an ETA?
@danielmay-whoop no ETA yet, but the connector team is aware of the issue!
If the redshift connector enables an init connection configuration
One can just set string truncation to solve the issue
SET json_parse_truncate_strings=ON;
Also encountered the issue:
2022-08-31 12:59:47 [42mnormalization[0m > -- depends_on: __dbt__cte__invoicesdb_ab2
2022-08-31 12:59:47 [42mnormalization[0m > select
2022-08-31 12:59:47 [42mnormalization[0m > md5(cast(coalesce(cast(ip as text), '') || '-' || ... <LOT OF COLUMNS> .. || '-' || coalesce(cast(country as text), '') as text)) as _airbyte_invoicesdb_hashid,
2022-08-31 12:59:47 [42mnormalization[0m > tmp.*
2022-08-31 12:59:47 [42mnormalization[0m > from __dbt__cte__invoicesdb_ab2 tmp
Basically during normalisation in dbt
, Airflow is adding a column _airbyte_invoicesdb_hashid
which is a concat of all the fileds. And if you have too many fields, you will eventually hit the limit.
(Airbyte 0.39.35, from mysql to Redshift)
json_parse_truncate_strings would still fix the problem, the hash would be inaccurate but at least it would work
you can test it out using
SELECT JSON_SERIALIZE(JSON_PARSE('[10001,10002,"a' || repeat('a', 70000) || 'bc"]'));
vs
SET json_parse_truncate_strings=ON;
SELECT JSON_SERIALIZE(JSON_PARSE('[10001,10002,"a' || repeat('a', 70000) || 'bc"]'));
In any case it would be super helpful to be able to set other init statement like datestyle, extra_float_digits, etc https://docs.aws.amazon.com/redshift/latest/dg/cm_chap_ConfigurationRef.html#t_Modifying_the_default_settings
This is such a show stopper for us, 3 of our connectors (Zendesk, Hubspot and Intercom) are all affected by this error and there is no way to proceed..
This is such a show stopper for us, 3 of our connectors (Zendesk, Hubspot and Intercom) are all affected by this error and there is no way to proceed..
Same here, we gave up on trying to work around and have switched to a different solution (not involving airbyte) for now. But I would like to revert back to airbyte if it becomes more stable.
@danielmay-whoop @ingo-m @validumitru @jan-benisek @osalloum Sorry to hear that this has impacted so many workspaces! There are no immediate plans to fix this issue as the team is currently focused on database connectors. This is something we can consider as part of planning Q4.
If you are able to submit a PR, you are of course welcome to.
hi @natalyjazzviolin, any updates here? It sounded like the Airbyte team was going to work on this in Q4? My company is blocked due to this issue. Getting the Redshift destination fully functional would be at the top of our wish list.
In the meantime, what is the official guidance? To downgrade the Redshift connector? I see someone saying version 0.2.28 and someone else saying 0.3.28. Which should we use? And is there guidance on how to do a connector downgrade?
Thx.
As @jan-benisek says, the problem seems to be in the hashid generation on the _stg table during normalization. I have data partially loaded there and if I query the table without the _hashid column it works fine. Apparently, it needs to concat all columns in a row together as a VARCHAR before hashing and Redshift doesn't like long VARCHARs.
SELECT MD5(CAST(((((((((((((((((((((((((((((((((((COALESCE(CAST("tmp"."id" AS VARCHAR), CAST('' AS VARCHAR)) || CAST('-' AS VARCHAR)) || COALESCE("tmp"."title", CAST('' AS VARCHAR))) || CAST('-' AS VARCHAR)) || COALESCE(CAST("tmp"."user_id" AS VARCHAR), CAST('' AS VARCHAR))) || CAST('-' AS VARCHAR)) || COALESCE(CAST("tmp"."version" AS VARCHAR), CAST('' AS VARCHAR))) || CAST('-' AS VARCHAR)) || COALESCE("tmp"."location", CAST('' AS VARCHAR))) ||
...lots of other stuff...
|| COALESCE(CAST("tmp"."created_at" AS VARCHAR), CAST('' AS VARCHAR))) || CAST('-' AS VARCHAR)) || COALESCE(CASE WHEN "tmp"."is_deleted" THEN CAST('true' AS VARCHAR) ELSE CAST('false' AS VARCHAR) END, CAST('' AS VARCHAR))) || CAST('-' AS VARCHAR)) || COALESCE(CASE WHEN "tmp"."is_private" THEN CAST('true' AS VARCHAR) ELSE CAST('false' AS VARCHAR) END, CAST('' AS VARCHAR))) || CAST('-' AS VARCHAR)) || COALESCE(CAST("tmp"."updated_at" AS VARCHAR), CAST('' AS VARCHAR))) AS VARCHAR)) AS "_airbyte_mylog_hashid", ...more columns
As it happens, I don't need all my source columns. Something that would make this less of a problem would be the ability to choose which columns to sync (https://github.com/airbytehq/airbyte/issues/9771) but until that's an option I can at least create a view in the source DB that trims my table down just to the columns I need and sync from that.
But this doesn't help people who are syncing from other sources they don't control (e.g. Zendesk). A potential workaround for this case I can see is manually editing the _hashid logic on the _stg table after it's created to remove some columns from the hash. But that's pretty hacky.
Anyway, I think Airbyte needs to figure out the solution. Allowing only a subset of columns to be selected from a source would definitely help. Otherwise, maybe it could detect this hash problem and just trim? Or allow users to decide which columns to include in the hash?
Would love to know if there are other workarounds I'm not thinking of too.
New onCall issue seems to be related to this issue https://github.com/airbytehq/alpha-beta-issues/issues/711
New onCall issue seems to be related to this issue https://github.com/airbytehq/alpha-beta-issues/issues/711
The link is broken. Any news about this issue? It makes Airbyte completely unusable for us
The same here, we have a problem pulling out a MySQL that fails when trying to save in a Super field of a redshift table. Do you have any updates on this? Thanks in advance and regards
I think issue looks similar (possibly identical) to https://github.com/airbytehq/airbyte/issues/16412 ? There's been some discussion there, though we're still figuring out how to actually make the change in a way that doesn't break existing users' data.
though also - column selection is a thing we're also actively working on, which sounds like would be a workaround for some of you.
Thanks for the info @edgao. Do you have some ETA about the column section? No rush and thanks for the info, regards
ETA about the column selection
we don't have one right now, no
Column selection would be a great feature as a workaround to this issue. Hope we can have this feature soon in Q1 2023.
Still no workaround here ?
Probably the fix is going to come from AWS itself by increase the varchar max size sooner than this is released here
For now, I am synchronizing to AWS S3 and using dbt glue to build the deduplicated datasets via hudi tables
I am running into the same issues with multiple databases. I would love to help with this.
To better understand the problemspace - What is the hashid used for in the first place? Maybe it is not necessary at all?
Also experiencing this problem with the Intercom source connector -> Redshift. Would be great if one of the solutions mentioned above would be implemented soon!
Hi everyone. Just got this error in a zendesk<>redshift connection. Any expectations about fixing it?
Just got this error when using a Metabase -> Redshift
connection.
The error message appears to be truncated both on the Airbyte logs and on Redshift, so not sure how to diagnose/fix the issue. Any ideas?
Error:
JSON_PARSE() error: String value exceeds the max size of 65535 bytes: (119205 bytes skipped) ":null}}}}]","collection (136 bytes skipped)
@FredericoCoelhoNunes also have a look at this: https://github.com/airbytehq/airbyte/issues/16412
This is an old issue and so far there's no indication that it will be resolved. Personally, I would recommend not to use airbyte if you're working with a redshift destination. There are workarounds, but if you don't have control over your source data, you're probably out of luck; you will have unexpected and unpredictable errors. Connections that work at first can break when new data exceeding the character limit is added to the source.
@FredericoCoelhoNunes With the new version of Airbyte you can at least filter out columns that are too long before they get sent to Redshift. So if you don't need all columns that would probably be the best solution. The other solution is to maintain your own dbt normalization which is very tedious.
@Auric-Manteo @ingo-m Thanks for sharing your experience! I need to first figure out which is the problem column, as I was not able to discover it in the logs; but it's possible that we might be able to cut it from the replication.
Having the same issue with shopyify - going to use fivetran until this is fixed
This will be handled soon by https://github.com/airbytehq/airbyte/issues/28541
and for those of you who were having issues with a concat
result being too long, rather than any single column value being too long - try out redshift dv2! It rewrites normalization to avoid the concat entirely.
@edgao @evantahler will #28541 fix the issue that a specific super field can not exceed 65k characters? im having issues in klaviyo email templates (html template string is over 65k characters)
@meyerovb yep! that's the exact problem it's solving.
@meyerovb yep! that's the exact problem it's solving.
Super… eta? I had to make a lambda job for now that reprocesses the s3 target files before I read it in spectrum.
no exact eta as yet, but we're actively working on it - most likely next couple months. There are a couple dependent things we need to build first (destination-side state management for better migrations, airbyte_meta in raw tables for raw data lineage / error tracking)
Environment
Current Behavior
Normalization of tickets sync into Redshift is failing with the error:
It looks like it's probably the description field in the ticket that might be longer than 65k bytes.
Expected Behavior
Could the description field be a super type ?