Closed akpatnam25 closed 3 weeks ago
cc @FMX @SteNicholas @mridulm @zhouyejoe
I think this change can make retrospective debugging hard as we will have no insight into which workers the shuffle data was written to and what was the behaviour of those workers during that time.
Instead of making it completely a debug log IMO we should at least output some basic info about the worker host selected for shuffle this will reduce the noise but also leave some space for debugging issues.
Ex –
Host: 10.152.63.41
WorkerStatus: WorkerStatus{state=Normal, stateStartTime=1724244049248}
NumOfSlotsAllocated: 100
I think this change can make retrospective debugging hard as we will have no insight into which workers the shuffle data was written to and what was the behaviour of those workers during that time.
Instead of making it completely a debug log IMO we should at least output some basic info about the worker host selected for shuffle this will reduce the noise but also leave some space for debugging issues.
Ex –
Host: 10.152.63.41 WorkerStatus: WorkerStatus{state=Normal, stateStartTime=1724244049248} NumOfSlotsAllocated: 100
I think this will still be noisy in a large fleet, especially in the driver logs where users might be looking for other logs/errors. I feel that if truly needed for debugging, the user can then enable debug logging for it.
I think this change can make retrospective debugging hard as we will have no insight into which workers the shuffle data was written to and what was the behaviour of those workers during that time. Instead of making it completely a debug log IMO we should at least output some basic info about the worker host selected for shuffle this will reduce the noise but also leave some space for debugging issues. Ex –
Host: 10.152.63.41 WorkerStatus: WorkerStatus{state=Normal, stateStartTime=1724244049248} NumOfSlotsAllocated: 100
I think this will still be noisy in a large fleet, especially in the driver logs where users might be looking for other logs/errors. I feel that if truly needed for debugging, the user can then enable debug logging for it.
Maybe print this log to one line can be helpful.
I think this change can make retrospective debugging hard as we will have no insight into which workers the shuffle data was written to and what was the behaviour of those workers during that time. Instead of making it completely a debug log IMO we should at least output some basic info about the worker host selected for shuffle this will reduce the noise but also leave some space for debugging issues. Ex –
Host: 10.152.63.41 WorkerStatus: WorkerStatus{state=Normal, stateStartTime=1724244049248} NumOfSlotsAllocated: 100
I think this will still be noisy in a large fleet, especially in the driver logs where users might be looking for other logs/errors. I feel that if truly needed for debugging, the user can then enable debug logging for it.
Maybe print this log to one line can be helpful.
Setting the logger levels for components could be the ultimate solution.
Maybe print this log to one line can be helpful.
@FMX i created a PR sometime back to support single line logging – https://github.com/apache/celeborn/pull/2672. Let me wdyt about this, i'll extend it to other places.
What changes were proposed in this pull request?
Change client side reserve slots log to debug.
Why are the changes needed?
in a large fleet of workers such as ours, it is noisy for the user in the driver logs.
Does this PR introduce any user-facing change?
No
How was this patch tested?
N/A