Open koderka2020 opened 3 weeks ago
Your "retry" table property has to be equivalent to the highest parallelism you will see. A REST catalog shouldn't have this issue though since it could queue up changes on the server rather than failing and retrying on the client.
Now if you aren't going to use a rest catalog you either have to keep bumping up the retry count (probably to the level of parallelism) or start batching your writes on the client side. I would probably use some streaming engine?
Hi Iceberg team, I've been searching for some time information on what is the max insert rate per sec or per min on iceberg table. We've been ingesting some large amounts of data (in tandem with trino and nessie) by concurrently running aws glue jobs. These jobs are failing at pretty high rate ("SystemExit: ERROR: An error occurred while calling o213.append.") even with the increased "retry" table property settings (25 retry, min 1000 ms wait, max 1500ms wait). If the parallelism is too high (1000-2500 concurrently running jobs trying to write to iceberg total of about 100k rows /500MB within 30mins) would you recommend some way around it? I was thinking creating staging table in postgress or creating multiple staging tables in iceberg to distribute the load and later after migrating the data to the main iceberg table at the end just dropping the staging tables. What are your thoughts on that?