apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.42k stars 1.27k forks source link

[Flaky Test] PinotTaskManagerStatelessTest.testPinotTaskManagerSchedulerWithUpdate() #8776

Closed Jackie-Jiang closed 2 years ago

Jackie-Jiang commented 2 years ago

Failures:

PinotTaskManagerStatelessTest.testPinotTaskManagerSchedulerWithUpdate:106->validateJob:245 expected [0 */20 * ? * * *] but found [0 */10 * ? * * *]

Example run: https://github.com/apache/pinot/runs/6597477355?check_suite_focus=true

gortiz commented 2 years ago

After some time thinking about this test and understanding how it works, I think the test was designed with this scenario in mind:


sequenceDiagram
    Test->>+Controller: update table config (0 */20 * ? * * *)
    Controller->>+Helix: update table config (0 */20 * ? * * *)
    Helix->>Helix: Change ideal state to (0 */20 * ? * * *)
    Helix->>-Controller: Ok
    Controller->>-Test: Ok

    Test->>+Controller: get job info
    Controller->>+Helix: get job info
    Helix->>-Controller: updated job info (0 */20 * ? * * *)
    Controller->>-Test: updated job info (0 */20 * ? * * *)

But I think Helix does not guaranteed that the sequence and sometimes, due to the lack of resources in GHA, we may find this scenario:


sequenceDiagram
    Test->>+Controller: update table config (0 */20 * ? * * *)
    Controller->>+Helix: update table config (0 */20 * ? * * *)
    Helix->>Controller: Ok
    Controller->>-Test: Ok

    Test->>+Controller: get job info
    Controller->>+Helix: get job info
    Helix->>-Controller: updated job info (0 */10 * ? * * *)

    Helix->>-Helix: Change ideal state (0 */20 * ? * * *)

    Controller->>-Test: updated job info (0 */10 * ? * * *)

If that is the case, the palliative solution is to retry the validation with some timeout.