Open s0nskar opened 2 weeks ago
@s0nskar, how did you use unreleased shuffle count in production practice?
@SteNicholas We're currently not in production but this will help us tune the forceExitTimeout config better and see if the default value is working for us or not. As we probably won't enable replication for lot of jobs, we want shuffle data to not be lost when worker exits.
@s0nskar, please add the UnreleasedShuffleCount
metric in celeborn-dashboard.json
file.
What changes were proposed in this pull request?
Adding a worker metrics for publish unreleased shuffle count when worker was decommissioned.
Why are the changes needed?
Currently celeborn don't publish the count of unreleased shuffle key which gets lost when a worker is decommissioned. This can be useful for monitoring and configuring the
forceExitTimeout
.Does this PR introduce any user-facing change?
NO
How was this patch tested?
NA