apache / celeborn

Apache Celeborn is an elastic and high-performance service for shuffle and spilled data.
https://celeborn.apache.org/
Apache License 2.0
862 stars 351 forks source link

[CELEBORN-1571] Fix flaky test - pushdata timeout will add to pushExcludedWorker #2697

Open cxzl25 opened 3 weeks ago

cxzl25 commented 3 weeks ago

What changes were proposed in this pull request?

Why are the changes needed?

Because the worker port is in use, the driver's worker status may change from shutdown status to unknown, causing the test to fail.

https://github.com/apache/celeborn/actions/runs/10465286274/job/28980278764

- celeborn spark integration test - pushdata timeout will add to pushExcludedWorkers *** FAILED ***
  WORKER_UNKNOWN did not equal PUSH_DATA_TIMEOUT_PRIMARY, and WORKER_UNKNOWN did not equal PUSH_DATA_TIMEOUT_REPLICA (PushDataTimeoutTest.scala:150)

unit-tests.log

24/08/20 05:28:30,400 INFO [celeborn-dispatcher-7] Master: Receive ReportNodeFailure [
Host: localhost
RpcPort: 41487
PushPort: 34259
FetchPort: 45713
ReplicatePort: 35107
InternalPort: 41487

24/08/20 05:29:29,414 WARN [celeborn-client-lifecycle-manager-change-partition-executor-3] WorkerStatusTracker: 
Reporting failed workers:
Host:localhost:RpcPort:42267:PushPort:43741:FetchPort:46483:ReplicatePort:43587   PUSH_DATA_TIMEOUT_PRIMARY   2024-08-19T22:29:29.414-0700
Current unknown workers:
Host:localhost:RpcPort:41487:PushPort:34259:FetchPort:45713:ReplicatePort:35107:InternalPort:41487   2024-08-19T22:29:29.108-0700
Current shutdown workers:
Host:localhost:RpcPort:41487:PushPort:34259:FetchPort:45713:ReplicatePort:35107:InternalPort:41487

Does this PR introduce any user-facing change?

No

How was this patch tested?

GA