airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
15.39k stars 3.97k forks source link

[destination-redshift] Statement is too large exception #35950

Open vamsikurre opened 5 months ago

vamsikurre commented 5 months ago

Connector Name

destination-redshift

Connector Version

2.1.7

What step the error happened?

During the sync

Relevant information

When the columns are more even with 10k partitioned insert is hitting the query size limit in Redshift This printed full insert statement multiple times in the log and the size of the log grew to 600MB image

Relevant log output

2024-03-06 09:31:29 destination > org.jooq.exception.DataAccessException: SQL [insert into "airbyte_internal"."datajump_schema_raw__stream_gl_journallinepvo" (_airbyte_raw_id, _airby....
2024-03-06 09:31:41 destination >  at org.jooq_3.16.23.POSTGRES.debug(Unknown Source) ~[?:?]
2024-03-06 09:31:41 destination >  at org.jooq.impl.Tools.translate(Tools.java:3130) ~[jooq-3.16.23.jar:?]
2024-03-06 09:31:41 destination >  at org.jooq.impl.DefaultExecuteContext.sqlException(DefaultExecuteContext.java:727) ~[jooq-3.16.23.jar:?]
2024-03-06 09:31:41 destination >  at org.jooq.impl.AbstractQuery.execute(AbstractQuery.java:343) ~[jooq-3.16.23.jar:?]
2024-03-06 09:31:41 destination >  at org.jooq.impl.AbstractDelegatingQuery.execute(AbstractDelegatingQuery.java:115) ~[jooq-3.16.23.jar:?]
2024-03-06 09:31:41 destination >  at io.airbyte.integrations.destination.redshift.operations.RedshiftSqlOperations.lambda$insertRecordsInternalV2$0(RedshiftSqlOperations.java:154) ~[io.airbyte.airbyte-integrations.connectors-destination-redshift-0.50.48.jar:?]
2024-03-06 09:31:41 destination >  at io.airbyte.cdk.db.jdbc.DefaultJdbcDatabase.execute(DefaultJdbcDatabase.java:47) ~[airbyte-cdk-core-0.20.0.jar:?]
2024-03-06 09:31:41 destination >  at io.airbyte.integrations.destination.redshift.operations.RedshiftSqlOperations.insertRecordsInternalV2(RedshiftSqlOperations.java:107) ~[io.airbyte.airbyte-integrations.connectors-destination-redshift-0.50.48.jar:?]
2024-03-06 09:31:41 destination >  at io.airbyte.cdk.integrations.destination.jdbc.JdbcSqlOperations.insertRecords(JdbcSqlOperations.java:206) ~[airbyte-cdk-db-destinations-0.20.0.jar:?]
2024-03-06 09:31:41 destination >  at io.airbyte.cdk.integrations.destination.jdbc.JdbcBufferedConsumerFactory.lambda$recordWriterFunction$2(JdbcBufferedConsumerFactory.java:209) ~[airbyte-cdk-db-destinations-0.20.0.jar:?]
2024-03-06 09:31:41 destination >  at io.airbyte.cdk.integrations.destination.jdbc.JdbcInsertFlushFunction.flush(JdbcInsertFlushFunction.java:25) ~[airbyte-cdk-db-destinations-0.20.0.jar:?]
2024-03-06 09:31:41 destination >  at io.airbyte.cdk.integrations.destination_async.FlushWorkers.lambda$flush$1(FlushWorkers.java:172) ~[airbyte-cdk-core-0.20.0.jar:?]
2024-03-06 09:31:41 destination >  at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) [?:?]
2024-03-06 09:31:41 destination >  at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
2024-03-06 09:31:41 destination >  at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
2024-03-06 09:31:41 destination >  at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
2024-03-06 09:31:41 destination >  at java.base/java.lang.Thread.run(Thread.java:833) [?:?]
2024-03-06 09:31:41 destination > Caused by: com.amazon.redshift.util.RedshiftException: ERROR: Statement is too large. Statement Size: 25914038 bytes. Maximum Allowed: 16777216 bytes
2024-03-06 09:31:41 destination >  at com.amazon.redshift.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2613) ~[redshift-jdbc42-2.1.0.23.jar:?]
2024-03-06 09:31:41 destination >  at com.amazon.redshift.core.v3.QueryExecutorImpl.processResultsOnThread(QueryExecutorImpl.java:2281) ~[redshift-jdbc42-2.1.0.23.jar:?]
2024-03-06 09:31:41 destination >  at com.amazon.redshift.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1886) ~[redshift-jdbc42-2.1.0.23.jar:?]
2024-03-06 09:31:41 destination >  at com.amazon.redshift.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1878) ~[redshift-jdbc42-2.1.0.23.jar:?]
2024-03-06 09:31:41 destination >  at com.amazon.redshift.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:375) ~[redshift-jdbc42-2.1.0.23.jar:?]
2024-03-06 09:31:41 destination >  at com.amazon.redshift.jdbc.RedshiftStatementImpl.executeInternal(RedshiftStatementImpl.java:515) ~[redshift-jdbc42-2.1.0.23.jar:?]
2024-03-06 09:31:41 destination >  at com.amazon.redshift.jdbc.RedshiftStatementImpl.execute(RedshiftStatementImpl.java:436) ~[redshift-jdbc42-2.1.0.23.jar:?]
2024-03-06 09:31:41 destination >  at com.amazon.redshift.jdbc.RedshiftStatementImpl.executeWithFlags(RedshiftStatementImpl.java:377) ~[redshift-jdbc42-2.1.0.23.jar:?]
2024-03-06 09:31:41 destination >  at com.amazon.redshift.jdbc.RedshiftStatementImpl.executeCachedSql(RedshiftStatementImpl.java:363) ~[redshift-jdbc42-2.1.0.23.jar:?]
2024-03-06 09:31:41 destination >  at com.amazon.redshift.jdbc.RedshiftStatementImpl.executeWithFlags(RedshiftStatementImpl.java:340) ~[redshift-jdbc42-2.1.0.23.jar:?]
2024-03-06 09:31:41 destination >  at com.amazon.redshift.jdbc.RedshiftStatementImpl.execute(RedshiftStatementImpl.java:330) ~[redshift-jdbc42-2.1.0.23.jar:?]
2024-03-06 09:31:41 destination >  at com.zaxxer.hikari.pool.ProxyStatement.execute(ProxyStatement.java:94) ~[HikariCP-5.1.0.jar:?]
2024-03-06 09:31:41 destination >  at com.zaxxer.hikari.pool.HikariProxyStatement.execute(HikariProxyStatement.java) ~[HikariCP-5.1.0.jar:?]
2024-03-06 09:31:41 destination >  at org.jooq.tools.jdbc.DefaultStatement.execute(DefaultStatement.java:102) ~[jooq-3.16.23.jar:?]
2024-03-06 09:31:41 destination >  at org.jooq.impl.SettingsEnabledPreparedStatement.execute(SettingsEnabledPreparedStatement.java:227) ~[jooq-3.16.23.jar:?]
2024-03-06 09:31:41 destination >  at org.jooq.impl.AbstractQuery.execute(AbstractQuery.java:428) ~[jooq-3.16.23.jar:?]
2024-03-06 09:31:41 destination >  at org.jooq.impl.AbstractDMLQuery.execute(AbstractDMLQuery.java:961) ~[jooq-3.16.23.jar:?]
2024-03-06 09:31:41 destination >  at org.jooq.impl.AbstractQuery.execute(AbstractQuery.java:329) ~[jooq-3.16.23.jar:?]
2024-03-06 09:31:41 destination >  ... 13 more

Contribute

marcosmarxm commented 5 months ago

I added the issue to the destination backlog. @evantahler if I'm not wrong this is actually being worked on, can you confirm?

evantahler commented 5 months ago

No. This is a redshift limitation on the size of the SQL statement that it allows (not the size of the record content). The only solution here would be to de-select some columns or split the sync into 2, producing 2 tables.

vamsikurre commented 5 months ago

As mentioned in the comment at below line, i think this should be destination aware partitioning rather than fixed 10k chunks

https://github.com/airbytehq/airbyte/blob/3339166dbdb898bf903fff221c0272e77101e029/airbyte-integrations/connectors/destination-redshift/src/main/java/io/airbyte/integrations/destination/redshift/operations/RedshiftSqlOperations.java#L115

This is working when i tried to reduce the chunk size to lower value, I feel this has to be smarter based on the underlying database

evantahler commented 5 months ago

Are you using s3-staging files to load data into redshift?

vamsikurre commented 5 months ago

Uploading method is standard

evantahler commented 5 months ago

Can you switch to S3? That is likely to work better for larger datasets.