airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
15.92k stars 4.09k forks source link

normalization: handle records > 1MB for redshift SUPER type #14573

Closed alafanechere closed 1 year ago

alafanechere commented 2 years ago

Tell us about the problem you're trying to solve

Redshift's normalization generates SUPER object whose size exceeds the limit of 1MB:

022-07-06 06:17:18 e[42mnormalizatione[0m > 06:17:18      error:  Invalid input
2022-07-06 06:17:18 e[42mnormalizatione[0m > 06:17:18      code:      8001
2022-07-06 06:17:18 e[42mnormalizatione[0m > 06:17:18      context:   SUPER value exceeds export size.
2022-07-06 06:17:18 e[42mnormalizatione[0m > 06:17:18      query:     7075389
2022-07-06 06:17:18 e[42mnormalizatione[0m > 06:17:18      location:  partiql_export.cpp:9
2022-07-06 06:17:18 e[42mnormalizatione[0m > 06:17:18      process:   query0_91_7075389 [pid=32005]

Describe the solution you’d like

Normalization should explicitly drop records > 1MB or restructure these records to make them lower than 1MB.

Seeing a similar issue with on-call airbytehq/alpha-beta-issues#697

SUPER type from Redshift docs

Related forum topic

marcosmarxm commented 2 years ago

Zendesk ticket #1473 has been linked to this issue.

marcosmarxm commented 2 years ago

Comment made from Zendesk by Augustin on 2022-07-11 at 13:03:

I created an [issue](https://github.com/airbytehq/airbyte/issues/14573) on our repo for this error. Please subscribe to receive updates. 
validumitru commented 2 years ago

This error is also happening while trying to sync engagements in the Hubspot connector.

jan-benisek commented 1 year ago

I encountered the same today (Airbyte 0.40.18, connector version 0.2.3). Any idea when will this be fixed 🙏

jena-binay commented 1 year ago

I'm on Airbyte 0.4.27 getting the same error on Jira connector (0.3.3)

validumitru commented 1 year ago

This error just started breaking the Hubspot sync for us today :(

cidraljunior commented 1 year ago

I am getting the same error. Any fix?

pranasziaukas commented 1 year ago

Running into this while syncing HubSpot Companies and Contacts into Redshift.

josephbrownskilljar commented 1 year ago

This is a limitation in Redshift and a solution is now in pre-release https://docs.aws.amazon.com/redshift/latest/dg/limitations-super.html

On 4/18/23, Pranas Ziaukas @.***> wrote:

Running into this while syncing HubSpot Companies and Contacts.

-- Reply to this email directly or view it on GitHub: https://github.com/airbytehq/airbyte/issues/14573#issuecomment-1513568955 You are receiving this because you are subscribed to this thread.

Message ID: @.***>

-- Joseph Brown Database Engineer @.*** 206-579-2889

alexandrafetterman commented 1 year ago

I am getting this error while synching Jira Issues into Redshift.

evantahler commented 1 year ago

Closing this issue as normalization is going away https://github.com/airbytehq/airbyte/issues/26028

pranasziaukas commented 1 year ago

normalization is going away

Could you expand a bit by any chance @evantahler?

For example, we had issues with HubSpot (source) records that were flowing to Redshift (destination), and because those records were large JSON objects they'd exceed Redshift's SUPER limit (as far as I understand).

What does the end of normalization imply for the above?

evantahler commented 1 year ago

The problem with large source records which can't fit in the destination still remains, regardless of normalization. We'll need to fix it more generally. We are discussing what to do about it here - https://github.com/airbytehq/airbyte/issues/28541