apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
5.87k stars 2.06k forks source link

Running MERGE INTO with more than one WHEN condition fails if the number of columns in the target table is > 321 #10294

Open andreaschiappacasse opened 1 month ago

andreaschiappacasse commented 1 month ago

Apache Iceberg version

None

Query engine

Athena (engine v3)

Please describe the bug 🐞

Hello everyone, today my team incurred in a very strange bug using Iceberg via Athena. I'll descrive the steps we used to reproduce the error below:

1. We create an iceberg table with an "id" column and 321 other columns with random strings - in the example below we use awsrangler to create the table, but the same happens when the table is created using Athena directly.

import awswrangler as wr
import pandas as pd
import random, string

NUM_COLS=322

def get_random_string(length):
    letters = string.ascii_lowercase
    result_str = ''.join(random.choice(letters) for i in range(length))
    return result_str

columns = ['id']+[get_random_string(5) for i in range(NUM_COLS-1) ]
data = pd.DataFrame(data=[columns], columns=columns)

wr.athena.to_iceberg(
    data,
    workgroup="my-workgroup",
    database="my_database",
    table="iceberg_limits_322",
    table_location="s3://my_bucket/iceberg_limits",
)

2. we then run the following query in athena to insert a random value

MERGE INTO my_database.iceberg_limits_322 as existing 
using (
    SELECT 'something' as id
) as new on existing.id = new.id
WHEN NOT MATCHED
THEN INSERT (id) VALUES (new.id)
WHEN MATCHED THEN DELETE

3. which results in the error:

[ErrorCode: INTERNAL_ERROR_QUERY_ENGINE] Amazon Athena experienced an internal error while executing this query. Please contact AWS support for further assistance. You will not be charged for this query. We apologize for the inconvenience.

Notice that the error only occurs when multiple WHEN are used in the MERGE INTO query! - in case one WHEN is used (just to insert or to delete records) everything works fine, and the table can be used normally.

We can replicate this behaviour on multiple AWS accounts and with different tables/databases/s3 locations.

After trying with different number of columns we consistently found that 321 is the maximum limit for the number of columns of the table. Everything works fine below this threshold.

andreaschiappacasse commented 1 month ago

Possibly something similar to https://github.com/trinodb/trino/issues/15848?

andreaschiappacasse commented 1 month ago

Update: it seems that even a MERGE INTO with a single NOT MACHED THEN INSERT condition fails, given that the table is big enough (in our case 633 columns)

krishan711 commented 4 days ago

I have the same issue also. i was hoping delete and then insert would work in separate statements (just to test it) but even this fails with too many columns:

MERGE INTO <table> target
USING <temp_table> source
ON target.source_id = source.source_id
WHEN MATCHED THEN DELETE