Open meatheadmike opened 2 months ago
I guess the huge resource cost comes from the rewrite of the parquets, because each update touches almost all the file groups, it will trigger a whole table rewrite.
If we have some range partitioning strategy that can limit the scope of updates instead of whole table, would be helpful.
I can certainly attempt partitioning again, but doesn't that just exacerbate the file group problem? My last attempt at partitioning made the batches take waaaaaaaay too long.
If the writing to many files groups truly is the problem, then why does the process go OOM instead of simply taking longer? Is seems that this is something that could be parallelized and sped up. But in my trial increasing cores did not help. Is there a way I can configure this to not use as much ram?
Is the OOM happening in Driver or executor? The driver could takes large amout of memory for the fs view file index, the solution to ease it to enable the rockdb-based fs view. The executor could OOM when the parquet and delta log records take memory, that would depend on your parallelism of the executors.
The OOM is happening on the executors. What would you recommend for parallelism? It's already set at 500.
@meatheadmike Can you please share the spark event logs ?
Here you go. I captured the most recent run. I stopped it before the driver completely quit, but you can see where the executors start OOMing.
From the UI perspective it's happening here:
Job Group: SparkUpsertPreppedDeltaCommitActionExecutor
Building workload profile:hudi_sink_metadata
[(kill)](http://localhost:4040/stages/stage/kill/?id=77)[start at NativeMethodAccessorImpl.java:0](http://localhost:4040/stages/stage/?id=77&attempt=1)+details
RDD: [UnionRDD](http://localhost:4040/storage/rdd/?id=162)[UnionRDD](http://localhost:4040/storage/rdd/?id=180)
Update: Bumping driver and executors to 16GB of memory allowed me to get to 2.5 billion rows. So the OOM is possible to overcome with more memory although I'd certainly prefer to run with less memory and more executors.
@meatheadmike Can you check the size of record index files. How many filegroups you see it in record index.
@meatheadmike Were you able to check on the same? Do you still face the issue?
I won't be able to check on this for a bit as I'm in the process of rebuilding our lakehouse environment. I'll reply as soon as I get a chance.
Thanks for the update @meatheadmike .
Describe the problem you faced
Hi Folks. I'm trying to get some advice here on how to better deal with a large upsert dataset.
The data has a very wide key space and no great/obvious partitionable columns. What I'm aiming for is a record key with roughly 1.5 billion unique values. Any one of those records could be updated at any time. The pipeline so far is a pyspark based streaming application. The eventual source will be a kafka topic, but at the moment it's using a datagenerator set to emit 10000 rows per second (specifically it's the dbldatagen python module). The sink is an S3 bucket with a MOR hudi table. The spark cluster is hosted under kubernetes on AWS. I'm using the latest hudi beta, the latest production spark (3.5.2), hive metastore 3.x and zookeeper for locking (latest production version). The executors are currently set to 10 at 8GB ram and 1 core each. And while I could feasibly increase the ram on the executors, I had previously done this (from 4 to 8) and it didn't buy me much. I seem to be topped out at roughly 500 million records (and 500 million distinct values on the record key).
The issue is that while I can get the pipeline to run, eventually it gets to a point where it simply OOM's on the executors during a union on the upsert. I've tried numerous memory / parallelism tuning options to no avail.
Is there a way to make this work without infinitely scaling memory? Is a wide keyspace like this a dead end? I did try partitioning by applying an md5 to the record key and using the 1st two characters. That gave me 256 unique more-or-less evenly distributed partitions. But that led to drastically increased batch times.
Record level indexing is enabled. That seems to help things on the write path. On the read path I've noticed that when data skipping is enabled queries actually take significantly longer. So perhaps my index has grown unmanageable? Maybe this is partially responsible for the OOM's on the write side.
To Reproduce
Steps to reproduce the behavior:
hudi config:
spark config:
Expected behavior
Pipeline should run without OOM
Environment Description
Hudi version : 1.0.0-beta2
Spark version : 3.5.2
Hive version : 3.1.3
Hadoop version : 3.3.4
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : yes
Additional context
Add any other context about the problem here.
Stacktrace
Any help / tips would be greatly appreciated!