Open cxzl25 opened 3 weeks ago
Because the worker port is in use, the driver's worker status may change from shutdown status to unknown, causing the test to fail.
https://github.com/apache/celeborn/actions/runs/10465286274/job/28980278764
- celeborn spark integration test - pushdata timeout will add to pushExcludedWorkers *** FAILED *** WORKER_UNKNOWN did not equal PUSH_DATA_TIMEOUT_PRIMARY, and WORKER_UNKNOWN did not equal PUSH_DATA_TIMEOUT_REPLICA (PushDataTimeoutTest.scala:150)
unit-tests.log
24/08/20 05:28:30,400 INFO [celeborn-dispatcher-7] Master: Receive ReportNodeFailure [ Host: localhost RpcPort: 41487 PushPort: 34259 FetchPort: 45713 ReplicatePort: 35107 InternalPort: 41487 24/08/20 05:29:29,414 WARN [celeborn-client-lifecycle-manager-change-partition-executor-3] WorkerStatusTracker: Reporting failed workers: Host:localhost:RpcPort:42267:PushPort:43741:FetchPort:46483:ReplicatePort:43587 PUSH_DATA_TIMEOUT_PRIMARY 2024-08-19T22:29:29.414-0700 Current unknown workers: Host:localhost:RpcPort:41487:PushPort:34259:FetchPort:45713:ReplicatePort:35107:InternalPort:41487 2024-08-19T22:29:29.108-0700 Current shutdown workers: Host:localhost:RpcPort:41487:PushPort:34259:FetchPort:45713:ReplicatePort:35107:InternalPort:41487
No
GA
What changes were proposed in this pull request?
Why are the changes needed?
Because the worker port is in use, the driver's worker status may change from shutdown status to unknown, causing the test to fail.
https://github.com/apache/celeborn/actions/runs/10465286274/job/28980278764
unit-tests.log
Does this PR introduce any user-facing change?
No
How was this patch tested?
GA