apache / celeborn

Apache Celeborn is an elastic and high-performance service for shuffle and spilled data.
https://celeborn.apache.org/
Apache License 2.0
896 stars 361 forks source link

[CELEBORN-1705] Fix disk buffer size is negative issue #2916

Closed turboFei closed 2 weeks ago

turboFei commented 2 weeks ago

What changes were proposed in this pull request?

Fix disk buffer size is negative issue.

Before, when writing for PartitionDataWriter with memory file storage

  1. if isMemoryShuffleFile is true, increment the memory file storage counter
  2. check if evict is needed, if that, flush the buffer and then set isMemoryShuffleFile to false
  3. add data into flushBuffer
  4. if memory file storage evicted, the data buffer would be released as disk buffer finally.

Then the disk buffer size would be negative finally, and memory file storage would be always positive.

In this PR, we update the counter after evict finished.

Why are the changes needed?

After no active running application in the celeborn cluster, I found that, it is abnormal per the celeborn worker log.

24/11/09 23:30:50,474 INFO [worker-memory-manager-reporter] MemoryManager: Direct memory usage: 276.0 MiB/40.0 GiB, disk buffer size: -748726.0 B, sort memory size: 0.0 B, read buffer size: 0.0 B, memory file storage size : 731.2 KiB
disk buffer size: -748726.0 B
memory file storage size : 731.2 KiB

Both of them are expected to be 0.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

UT and Integration testing.

image
turboFei commented 2 weeks ago

cc @FMX @RexXiong