Open phani482 opened 1 year ago
Thanks for the feedback @phani482 , sorry to tell you that cleaning of archival files are not supported now, I have created a JIRA issue to track this: https://issues.apache.org/jira/browse/HUDI-5659
I also noticed that you use the INSERT
operation, so which spark stage did you perceive as slow?
Not slowness, our jobs are failing with above error while hudi write. Is it an issue if we remove archive files from .hoodie folder?
We keep reducing these commits as a short term fix, but as we consume timeseries data in our downstream jobs with this hudi table as source, we cannot reduce these numbers further. As of now we are trying to maintain 2 days worth of commits so that our downstream jobs(glue-job2) have enough time to process the commits in case of any unexpected slowness/issues etc.
{Flow looks like this : kinesis>(glue-job1)>hudi(insert)>(glue-job2)>hudi(upsert)}
my_stream.writeStream.foreachBatch(batch_write_hudi)
.option("checkpointLocation", f"{args['checkpoint_location']}/")
.trigger(processingTime="120 seconds")
.start()
.awaitTermination()
=====
Looking fwd for your insights. Thanks!
hey @phani482 sorry for the late turn aorund. Have you enabled sync by any chance? recently we found an issue where meta sync is loading the archival timeline unnecessarily.
https://github.com/apache/hudi/pull/7561
If you can try w/ 0.13.0 and let us know what do you see, would be nice. or you can cherry-pick this commit into your internal fork if you have one.
but as far as trimming down the no of files, we don't have any automatic support as of now. but will be working on it. if you are interested to work on it, let us know. we can guide you
Thanks! @nsivabalan
Will try it out and see if this will fix our issue. Although this could take some time for us to implement in prod. Will post here whenever we do the upgrade. Thanks again!
Hello Team,
We are running Glue streaming Job which reads from kinesis and writes to Hudi COW table (s3) on glue catalog. The Job is running since ~1year without issues. However, lately we started seeing OOM errors as below without much insights from the logging.
a. I tried moving [.commits.archive] files out of .hoodie folder to reduce the size of the .hoodie folder. This helped for a while but the issue started to surface again. (s3:///prefix/.hoodie/.commits .archive.1763_1-0-1
b. Here are the write options we are using for Apache Hudi Connector 0.9.0 "hoodie.datasource.write.operation": "insert", "hoodie.insert.shuffle.parallelism": 10, "hoodie.bulkinsert.shuffle.parallelism": 10, "hoodie.upsert.shuffle.parallelism": 10, "hoodie.delete.shuffle.parallelism": 10, "hoodie.parquet.small.file.limit": 8 1000 1000, # 8MB "hoodie.parquet.max.file.size": 10 1000 1000, # 10 MB "hoodie.datasource.hive_sync.use_jdbc": "false", "hoodie.datasource.hive_sync.enable": "false", "hoodie.datasource.hive_sync.database": "database_name", "hoodie.datasource.hive_sync.table": "raw_table_name", "hoodie.datasource.hive_sync.partition_fields": "entity_name", "hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor", "hoodie.datasource.hive_sync.support_timestamp": "true", "hoodie.keep.min.commits": 1450,
"hoodie.keep.max.commits": 1500,
"hoodie.cleaner.commits.retained": 1449,
Error:
INFO:py4j.java_gateway:Received command on object id INFO:py4j.java_gateway:Closing down callback connection
INFO:py4j.java_gateway:Callback Connection ready to receive messages INFO:py4j.java_gateway:Received command c on object id p0 INFO:root:Batch ID: 160325 has 110 records java.lang.OutOfMemoryError: Requested array size exceeds VM limit# -XX:OnOutOfMemoryError="kill -9 %p"# Executing /bin/sh -c "kill -9 7"...
================
Q: We noticed that ".commits_.archive" files are not being cleaned up by hoodie by default. Are there any settings we need to enable for this to happen ?
Q: Our .hoodie folder was ~1.5 GB in size before we started moving archive file out of the folder. Is this a hude size for .hoodie folder to be ? What are the best practices to maintain .hoodie folder in terms of size and object count?
Q: The error logs doesnt indicate more details, but even after using 20 G.1x type DPU on GLue this seems to be not helping. (executor memeory: 10GB, Driver memeory 10 GB, executor cores 8). Our workload is not huge, we get few thousands of events every hr on avg 1 million records a day is what our job processes. The payload size is not more than ~300kb
Please let me know if you need any further details
Thanks