apache / celeborn

Apache Celeborn is an elastic and high-performance service for shuffle and spilled data.
https://celeborn.apache.org/
Apache License 2.0
862 stars 351 forks source link

[CELEBORN-1582] Publish metric for unreleased shuffle count when worker was decommissioned #2711

Open s0nskar opened 2 weeks ago

s0nskar commented 2 weeks ago

What changes were proposed in this pull request?

Adding a worker metrics for publish unreleased shuffle count when worker was decommissioned.

Why are the changes needed?

Currently celeborn don't publish the count of unreleased shuffle key which gets lost when a worker is decommissioned. This can be useful for monitoring and configuring the forceExitTimeout.

Does this PR introduce any user-facing change?

NO

How was this patch tested?

NA

SteNicholas commented 2 weeks ago

@s0nskar, how did you use unreleased shuffle count in production practice?

s0nskar commented 2 weeks ago

@SteNicholas We're currently not in production but this will help us tune the forceExitTimeout config better and see if the default value is working for us or not. As we probably won't enable replication for lot of jobs, we want shuffle data to not be lost when worker exits.

SteNicholas commented 2 weeks ago

@s0nskar, please add the UnreleasedShuffleCount metric in celeborn-dashboard.json file.