apache / celeborn

Apache Celeborn is an elastic and high-performance service for shuffle and spilled data.
https://celeborn.apache.org/
Apache License 2.0
862 stars 351 forks source link

[CELEBORN-1573] Change to debug logging on client side for reserve slots #2700

Closed akpatnam25 closed 3 weeks ago

akpatnam25 commented 3 weeks ago

What changes were proposed in this pull request?

Change client side reserve slots log to debug.

Why are the changes needed?

in a large fleet of workers such as ours, it is noisy for the user in the driver logs.

Does this PR introduce any user-facing change?

No

How was this patch tested?

N/A

akpatnam25 commented 3 weeks ago

cc @FMX @SteNicholas @mridulm @zhouyejoe

s0nskar commented 3 weeks ago

I think this change can make retrospective debugging hard as we will have no insight into which workers the shuffle data was written to and what was the behaviour of those workers during that time.

Instead of making it completely a debug log IMO we should at least output some basic info about the worker host selected for shuffle this will reduce the noise but also leave some space for debugging issues.

Ex –

Host: 10.152.63.41
WorkerStatus: WorkerStatus{state=Normal, stateStartTime=1724244049248}
NumOfSlotsAllocated: 100
akpatnam25 commented 3 weeks ago

I think this change can make retrospective debugging hard as we will have no insight into which workers the shuffle data was written to and what was the behaviour of those workers during that time.

Instead of making it completely a debug log IMO we should at least output some basic info about the worker host selected for shuffle this will reduce the noise but also leave some space for debugging issues.

Ex –

Host: 10.152.63.41
WorkerStatus: WorkerStatus{state=Normal, stateStartTime=1724244049248}
NumOfSlotsAllocated: 100

I think this will still be noisy in a large fleet, especially in the driver logs where users might be looking for other logs/errors. I feel that if truly needed for debugging, the user can then enable debug logging for it.

FMX commented 3 weeks ago

I think this change can make retrospective debugging hard as we will have no insight into which workers the shuffle data was written to and what was the behaviour of those workers during that time. Instead of making it completely a debug log IMO we should at least output some basic info about the worker host selected for shuffle this will reduce the noise but also leave some space for debugging issues. Ex –

Host: 10.152.63.41
WorkerStatus: WorkerStatus{state=Normal, stateStartTime=1724244049248}
NumOfSlotsAllocated: 100

I think this will still be noisy in a large fleet, especially in the driver logs where users might be looking for other logs/errors. I feel that if truly needed for debugging, the user can then enable debug logging for it.

Maybe print this log to one line can be helpful.

FMX commented 3 weeks ago

I think this change can make retrospective debugging hard as we will have no insight into which workers the shuffle data was written to and what was the behaviour of those workers during that time. Instead of making it completely a debug log IMO we should at least output some basic info about the worker host selected for shuffle this will reduce the noise but also leave some space for debugging issues. Ex –

Host: 10.152.63.41
WorkerStatus: WorkerStatus{state=Normal, stateStartTime=1724244049248}
NumOfSlotsAllocated: 100

I think this will still be noisy in a large fleet, especially in the driver logs where users might be looking for other logs/errors. I feel that if truly needed for debugging, the user can then enable debug logging for it.

Maybe print this log to one line can be helpful.

Setting the logger levels for components could be the ultimate solution.

s0nskar commented 3 weeks ago

Maybe print this log to one line can be helpful.

@FMX i created a PR sometime back to support single line logging – https://github.com/apache/celeborn/pull/2672. Let me wdyt about this, i'll extend it to other places.