phani482 commented 1 year ago

Hello Team,

We are running Glue streaming Job which reads from kinesis and writes to Hudi COW table (s3) on glue catalog. The Job is running since ~1year without issues. However, lately we started seeing OOM errors as below without much insights from the logging.

a. I tried moving [.commits.archive] files out of .hoodie folder to reduce the size of the .hoodie folder. This helped for a while but the issue started to surface again. (s3:///prefix/.hoodie/.commits.archive.1763_1-0-1

b. Here are the write options we are using for Apache Hudi Connector 0.9.0 "hoodie.datasource.write.operation": "insert", "hoodie.insert.shuffle.parallelism": 10, "hoodie.bulkinsert.shuffle.parallelism": 10, "hoodie.upsert.shuffle.parallelism": 10, "hoodie.delete.shuffle.parallelism": 10, "hoodie.parquet.small.file.limit": 8 1000 1000, # 8MB "hoodie.parquet.max.file.size": 10 1000 1000, # 10 MB "hoodie.datasource.hive_sync.use_jdbc": "false", "hoodie.datasource.hive_sync.enable": "false", "hoodie.datasource.hive_sync.database": "database_name", "hoodie.datasource.hive_sync.table": "raw_table_name", "hoodie.datasource.hive_sync.partition_fields": "entity_name", "hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor", "hoodie.datasource.hive_sync.support_timestamp": "true", "hoodie.keep.min.commits": 1450,
"hoodie.keep.max.commits": 1500,
"hoodie.cleaner.commits.retained": 1449,

Error:

INFO:py4j.java_gateway:Received command on object id INFO:py4j.java_gateway:Closing down callback connection

INFO:py4j.java_gateway:Callback Connection ready to receive messages INFO:py4j.java_gateway:Received command c on object id p0 INFO:root:Batch ID: 160325 has 110 records java.lang.OutOfMemoryError: Requested array size exceeds VM limit# -XX:OnOutOfMemoryError="kill -9 %p"# Executing /bin/sh -c "kill -9 7"...

================

Q: We noticed that ".commits_.archive" files are not being cleaned up by hoodie by default. Are there any settings we need to enable for this to happen ?

Q: Our .hoodie folder was ~1.5 GB in size before we started moving archive file out of the folder. Is this a hude size for .hoodie folder to be ? What are the best practices to maintain .hoodie folder in terms of size and object count?

Q: The error logs doesnt indicate more details, but even after using 20 G.1x type DPU on GLue this seems to be not helping. (executor memeory: 10GB, Driver memeory 10 GB, executor cores 8). Our workload is not huge, we get few thousands of events every hr on avg 1 million records a day is what our job processes. The payload size is not more than ~300kb

Please let me know if you need any further details

Thanks

danny0405 commented 1 year ago

Thanks for the feedback @phani482 , sorry to tell you that cleaning of archival files are not supported now, I have created a JIRA issue to track this: https://issues.apache.org/jira/browse/HUDI-5659

I also noticed that you use the INSERT operation, so which spark stage did you perceive as slow?

phani482 commented 1 year ago

Not slowness, our jobs are failing with above error while hudi write. Is it an issue if we remove archive files from .hoodie folder?

Does Hudi ignore archive files from .hoodie folder ? Will it read archive files into timeline server?
For a long running streaming Job, what are the best practices to manage metadata folder (.hoodie) to avoid out of memory errors?
Are there any spark heap settings required to be tuned? The hudi documentation is not clear enough on this
We reduced following settings by 100 commits each to overcome the issue for the time being. But unable to understand the actual correlation between these settings and why we got java.lang.OutOfMemoryError: Requested array size exceeds VM limit error . Can someone please shed some light here?

We keep reducing these commits as a short term fix, but as we consume timeseries data in our downstream jobs with this hudi table as source, we cannot reduce these numbers further. As of now we are trying to maintain 2 days worth of commits so that our downstream jobs(glue-job2) have enough time to process the commits in case of any unexpected slowness/issues etc.
{Flow looks like this : kinesis>(glue-job1)>hudi(insert)>(glue-job2)>hudi(upsert)}

"hoodie.keep.min.commits": 1450,>>>> 1350
"hoodie.keep.max.commits": 1500,>>>> 1400

"hoodie.cleaner.commits.retained": 1449,>>> 1349

    my_stream.writeStream.foreachBatch(batch_write_hudi)
    .option("checkpointLocation", f"{args['checkpoint_location']}/")
    .trigger(processingTime="120 seconds")
    .start()
    .awaitTermination()

=====

Looking fwd for your insights. Thanks!

nsivabalan commented 1 year ago

hey @phani482 sorry for the late turn aorund. Have you enabled sync by any chance? recently we found an issue where meta sync is loading the archival timeline unnecessarily.

https://github.com/apache/hudi/pull/7561

If you can try w/ 0.13.0 and let us know what do you see, would be nice. or you can cherry-pick this commit into your internal fork if you have one.

nsivabalan commented 1 year ago

but as far as trimming down the no of files, we don't have any automatic support as of now. but will be working on it. if you are interested to work on it, let us know. we can guide you

phani482 commented 1 year ago

Thanks! @nsivabalan

Will try it out and see if this will fix our issue. Although this could take some time for us to implement in prod. Will post here whenever we do the upgrade. Thanks again!

apache / hudi

[SUPPORT] "java.lang.OutOfMemoryError: Requested array size exceeds VM limit" while writing to Hudi COW table #7800

Error:

"hoodie.cleaner.commits.retained": 1449,>>> 1349