apache / amoro

Apache Amoro (incubating) is a Lakehouse management system built on open data lake formats.
https://amoro.apache.org/
Apache License 2.0
747 stars 260 forks source link

[Improvement]: Eliminate AMS Full GC impact deriving from local terminal clean spark context #2969 #2973

Closed nicochen closed 3 days ago

nicochen commented 1 week ago

This pr is related to issue #2969 and has been manually tested locally.

Why are the changes needed?

Close #2969.

Brief change log

-

How was this patch tested?

Documentation

baiyangtx commented 1 week ago

I think it might be more reasonable to adjust the value of spark.cleaner.periodicGC.interval

From codes of spark, the parameter spark.cleaner.referenceTracking controls whether to create the context cleaner.

image

And inside the ContextCleaner, there are two theads managed.

image image image

I think the role of the cleanThread is quite important, especially when we start a resident Spark Context, which can release disk resources in time. On the other hand, GC triggered by AMS is not a big problem. The parameter spark.cleaner.periodicGC.interval can be adjusted to a very large value to avoid actively triggering GC.

nicochen commented 1 week ago

@baiyangtx I also considered the 'spark.cleaner.periodicGC.interval ' config key before, but as you said we can only set it an extremely large number rather than disabling it. In our production case, the full GC really matters. The 1.8 JDK uses parallel gc as full gc strategy for G1,.As a result, it takes more than 30 secs and triggers zk timeout and AMS failover which is unacceptable. Also, I believe the local terminal is designed to take lightweight and less frequent sql tasks and it would not produce too much RDD \ shuffle trash. The heavy tasks should go to kyubi or spark . Thus, I choose overall stability instead of enlarge number of gc interval.

zhoujinsong commented 6 days ago

@baiyangtx What do you think about the PR now?