Closed ehurheap closed 1 year ago
Maybe your commits are archived before their committed data files are cleaned?
You can check the earliest commit in your hoodie active timeline.
Prior to the earliest deltacommit
, we have 23 timestamps with cleans inflight: for example:
20221119132417991.clean
20221119132417991.clean.inflight
20221119132417991.clean.requested
Then the earliest deltacommit
is:
20221208164706388.deltacommit
20221208164706388.deltacommit.inflight
20221208164706388.deltacommit.requested
20221208164706388.savepoint
20221208164706388.savepoint.inflight
Following that filegroup there are a few replacecommit
s:
20221208174357204.replacecommit
20221208174357204.replacecommit.inflight
20221208174357204.replacecommit.requested
etc. Followed by more deltacommits
.
(This is from looking at the files in the S3 console - trying to use the hudi-cli commands is very slow, as it can take an hour to list commits for this table.)
I have attempted to run the cleaner as a separate step from the ingestion. The ingestion is now configured with
hoodie.clean.automatic -> false
hoodie.archive.automatic -> false
Using the hudi-cli I submitted this clean command:
cleans run --sparkMaster local[8] --sparkMemory 60G --hoodieConfigs "hoodie.cleaner.policy=KEEP_LATEST_BY_HOURS hoodie.cleaner.hours.retained=1920 hoodie.cleaner.parallelism=400"
After an hour and a half, it failed with
...INFO S3NativeFileSystem: Opening 's3://path-to-table/users_changes-v1/.hoodie/20221208164706388.savepoint' for reading
#udi:users_changes->
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
# Executing /bin/sh -c "kill -9 1614"...
Failed to clean hoodie dataset
I tried tweaking the hudi-cli.sh script to specify -Xmx=90G
in the java command, but that did not help.
As an alternative I attempted the spark-submit
version of the cleans command like so:
spark-submit --deploy-mode cluster \
--conf spark.executor.instances=30 \
--conf spark.executor.cores=2 \
--conf spark.executor.memory=20G \
--conf spark.driver.memory=40G \
--conf spark.app.name=HoodieCleaner_users_changes-v1 \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--class org.apache.hudi.utilities.HoodieCleaner /usr/lib/hudi/hudi-utilities-bundle_2.12-0.11.1-amzn-0.jar \
--target-base-path s3://path-to-table/users_changes-v1 \
--hoodie-conf hoodie.metadata.enable=false \
--hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_BY_HOURS \
--hoodie-conf hoodie.cleaner.hours.retained=2000 \
--hoodie-conf hoodie.cleaner.parallelism=400 \
--hoodie-conf hoodie.clean.allow.multiple=false \
--hoodie-conf hoodie.archive.async=false \
--hoodie-conf hoodie.archive.automatic=false
After about an hour the application attempt dies and the driver logs show this:
23/03/22 00:43:50 ERROR Javalin: Exception occurred while servicing http-request
java.lang.OutOfMemoryError: null
at java.lang.AbstractStringBuilder.hugeCapacity(AbstractStringBuilder.java:161) ~[?:1.8.0_362]
at java.lang.AbstractStringBuilder.newCapacity(AbstractStringBuilder.java:155) ~[?:1.8.0_362]
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:125) ~[?:1.8.0_362]
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:596) ~[?:1.8.0_362]
at java.lang.StringBuilder.append(StringBuilder.java:195) ~[?:1.8.0_362]
at com.fasterxml.jackson.core.util.TextBuffer.contentsAsString(TextBuffer.java:460) ~[jackson-core-2.13.3.jar:2.13.3]
at com.fasterxml.jackson.core.io.SegmentedStringWriter.getAndClear(SegmentedStringWriter.java:85) ~[jackson-core-2.13.3.jar:2.13.3]
at com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:3827) ~[jackson-databind-2.13.3.jar:2.13.3]
at org.apache.hudi.timeline.service.RequestHandler.jsonifyResult(RequestHandler.java:198) ~[__app__.jar:0.11.1-amzn-0]
at org.apache.hudi.timeline.service.RequestHandler.writeValueAsStringSync(RequestHandler.java:209) ~[__app__.jar:0.11.1-amzn-0]
at org.apache.hudi.timeline.service.RequestHandler.writeValueAsString(RequestHandler.java:176) ~[__app__.jar:0.11.1-amzn-0]
at org.apache.hudi.timeline.service.RequestHandler.lambda$registerFileSlicesAPI$18(RequestHandler.java:384) ~[__app__.jar:0.11.1-amzn-0]
at org.apache.hudi.timeline.service.RequestHandler$ViewHandler.handle(RequestHandler.java:501) ~[__app__.jar:0.11.1-amzn-0]
at io.javalin.core.security.SecurityUtil.noopAccessManager(SecurityUtil.kt:23) ~[__app__.jar:0.11.1-amzn-0]
at io.javalin.http.JavalinServlet$addHandler$protectedHandler$1.handle(JavalinServlet.kt:128) ~[__app__.jar:0.11.1-amzn-0]
at io.javalin.http.JavalinServlet$service$2$1.invoke(JavalinServlet.kt:45) ~[__app__.jar:0.11.1-amzn-0]
at io.javalin.http.JavalinServlet$service$2$1.invoke(JavalinServlet.kt:24) ~[__app__.jar:0.11.1-amzn-0]
at io.javalin.http.JavalinServlet$service$1.invoke(JavalinServlet.kt:136) ~[__app__.jar:0.11.1-amzn-0]
at io.javalin.http.JavalinServlet$service$2.invoke(JavalinServlet.kt:40) ~[__app__.jar:0.11.1-amzn-0]
at io.javalin.http.JavalinServlet.service(JavalinServlet.kt:81) ~[__app__.jar:0.11.1-amzn-0]
at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) ~[javax.servlet-api-3.1.0.jar:3.1.0]
at io.javalin.websocket.JavalinWsServlet.service(JavalinWsServlet.kt:51) ~[__app__.jar:0.11.1-amzn-0]
I tried several attempts tweaking these parameters for each run:
--hoodie-conf hoodie.cleaner.hours.retained
--conf spark.executor.instances
--conf spark.executor.cores
--conf spark.executor.memory
--conf spark.driver.memory
Each time I got some error, for example both the following are errors that occurred in different runs:
# java.lang.OutOfMemoryError: Requested array size exceeds VM limit
and
23/03/21 20:39:21 ERROR RequestHandler: Got runtime exception servicing request partition=env_id%3D2907378054%2Fweek%3D20221121&basepath=s3%3A%2F%2Fpath-to-table%2Fusers_changes-v1&lastinstantts=20230321200037228&timelinehash=c015e055fa5d5d3f14376d8c4aee8b41e5be8cd928f0c72068646cb95f9365c5
java.lang.NegativeArraySizeException: null
at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:68) ~[?:1.8.0_362]
at java.lang.StringBuilder.<init>(StringBuilder.java:106) ~[?:1.8.0_362]
at com.fasterxml.jackson.core.util.TextBuffer.contentsAsString(TextBuffer.java:455) ~[jackson-core-2.13.3.jar:2.13.3]
at com.fasterxml.jackson.core.io.SegmentedStringWriter.getAndClear(SegmentedStringWriter.java:85) ~[jackson-core-2.13.3.jar:2.13.3]
at com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:3827) ~[jackson-databind-2.13.3.jar:2.13.3]
at org.apache.hudi.timeline.service.RequestHandler.jsonifyResult(RequestHandler.java:198) ~[__app__.jar:0.11.1-amzn-0]
at org.apache.hudi.timeline.service.RequestHandler.writeValueAsStringSync(RequestHandler.java:209) ~[__app__.jar:0.11.1-amzn-0]
at org.apache.hudi.timeline.service.RequestHandler.writeValueAsString(RequestHandler.java:176) ~[__app__.jar:0.11.1-amzn-0]
at org.apache.hudi.timeline.service.RequestHandler.lambda$registerFileSlicesAPI$18(RequestHandler.java:384) ~[__app__.jar:0.11.1-amzn-0]
at org.apache.hudi.timeline.service.RequestHandler$ViewHandler.handle(RequestHandler.java:501) ~[__app__.jar:0.11.1-amzn-0]
at io.javalin.core.security.SecurityUtil.noopAccessManager(SecurityUtil.kt:23) ~[__app__.jar:0.11.1-amzn-0]
Is there some way to reduce the size of the clean operation so that we don't run into these errors?
This ticket seems to describe similar symptoms
likely you have too huge no of partitions and hence jsonifying the results might result in OOM
can you try disabling embedded timeline server for couple of commits and re-enable cleaner. hoodie.embed.timeline.server=false.
hopefully this should make cleaner get past the issue.
Hi @nsivabalan - confirming that the cleaner worked with this config setting - THANKS!
Can you comment on why the timeline server does not have this problem during ingestion, only during cleaning?
cool, we identified some perf hit w/ timeline server when dealing w/ too many files. https://github.com/apache/hudi/pull/8480 we fixed it in latest master and could go into 0.13.1.
So, I suggested to disable timeline server to get past few cleaner where replaced file groups are cleaned up. once the clean up is done, may be number of file groups to deal with are less and hence the timeline server should be able to cope up w/ it (w/o the above fix)
Describe the problem you faced
We are running spark streaming ingestion with the following cleaner configs:
Ingestion commits happen about every 20-30 minutes. However using the hudi-cli I can see that the cleans occur far less frequently, and at some point about 3 weeks ago cleans stopped happening altogether. When the ingestion was restarted, it stalled on
Generating list of file slices to be cleaned:
, and eventually the executors ran out of memory and the job failed.To allow ingestion to proceed we redeployed with automatic cleaner disabled.
Questions:
To Reproduce
Steps to reproduce the behavior:
cleans show
Expected behavior The cleaner table service is invoked immediately after each commit.
Environment Description
Hudi version : 0.12.1
Spark version : 3.3
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : no
Additional context
additional write configs include