apache / celeborn

Apache Celeborn is an elastic and high-performance service for shuffle and spilled data.
https://celeborn.apache.org/
Apache License 2.0
862 stars 351 forks source link

[CELEBORN-1570] Fix flaky test - monitor non-critical error metrics in DeviceMonitorSuite #2694

Closed cxzl25 closed 3 weeks ago

cxzl25 commented 3 weeks ago

What changes were proposed in this pull request?

Why are the changes needed?

https://github.com/apache/celeborn/actions/runs/10441850633/job/28913478820?pr=2692

- monitor non-critical error metrics *** FAILED ***
  java.lang.AssertionError: expected:<2> but was:<1>
  at org.junit.Assert.fail(Assert.java:89)
  at org.junit.Assert.failNotEquals(Assert.java:835)
  at org.junit.Assert.assertEquals(Assert.java:120)
  at org.junit.Assert.assertEquals(Assert.java:146)
  at org.apache.celeborn.service.deploy.worker.storage.DeviceMonitorSuite.$anonfun$new$22(DeviceMonitorSuite.scala:405)

Because calling System.currentTimeMillis() twice may get the same value. Gauge uses the size of the set, and the same value will be deduplicated.

https://github.com/apache/celeborn/blob/0ee3c3a4bdef3b97fca6be04f03ac36c2e12e1a6/worker/src/main/scala/org/apache/celeborn/service/deploy/worker/storage/ObservedDevice.scala#L81-L91

Does this PR introduce any user-facing change?

No

How was this patch tested?

GA

FMX commented 3 weeks ago

The method System.currentTimeMillis might return the same value for two invokes. I think changing this UT is not the correct approach to fix this problem. Changing the definition of nonCriticalErrors to List will be better.