Closed nicochen closed 3 days ago
I think it might be more reasonable to adjust the value of spark.cleaner.periodicGC.interval
From codes of spark, the parameter spark.cleaner.referenceTracking
controls whether to create the context cleaner.
And inside the ContextCleaner, there are two theads managed.
System.gc
every spark.cleaner.periodicGC.interval
period.
I think the role of the cleanThread is quite important, especially when we start a resident Spark Context, which can release disk resources in time.
On the other hand, GC triggered by AMS is not a big problem. The parameter spark.cleaner.periodicGC.interval
can be adjusted to a very large value to avoid actively triggering GC.
@baiyangtx I also considered the 'spark.cleaner.periodicGC.interval ' config key before, but as you said we can only set it an extremely large number rather than disabling it. In our production case, the full GC really matters. The 1.8 JDK uses parallel gc as full gc strategy for G1,.As a result, it takes more than 30 secs and triggers zk timeout and AMS failover which is unacceptable. Also, I believe the local terminal is designed to take lightweight and less frequent sql tasks and it would not produce too much RDD \ shuffle trash. The heavy tasks should go to kyubi or spark . Thus, I choose overall stability instead of enlarge number of gc interval.
@baiyangtx What do you think about the PR now?
This pr is related to issue #2969 and has been manually tested locally.
Why are the changes needed?
Close #2969.
Brief change log
-
How was this patch tested?
[ ] Add some test cases that check the changes thoroughly including negative and positive cases if possible
[ ] Add screenshots for manual tests if appropriate
[x] Run test locally before making a pull request
Documentation