Fix race condition in parallel tests.

kl52752 commented 4 months ago

Parallel executor tests checks that when timeout occurs parallel executor will exit even if long lasting operation scheduled in separate go routine is running. This go routine might finish after the whole test is finished and this is expected behaviour. In this go routine we should not use t.Log which will be invoked at the end of the action because the test might already exit and this can cause panic. https://screenshot.googleplex.com/3tRWo69XkSfvacf

kl52752 commented 4 months ago

/cc @AwesomePatrol /assign @bowei

google-oss-prow[bot] commented 4 months ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: AwesomePatrol, kl52752 Once this PR has been reviewed and has the lgtm label, please ask for approval from bowei. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[OWNERS](https://github.com/GoogleCloudPlatform/k8s-cloud-provider/blob/master/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment

bowei commented 4 months ago

Did you upload the whole PR? There is only one change (delete Log line)

kl52752 commented 4 months ago

Did you upload the whole PR? There is only one change (delete Log line)

yes, the problem is that this one log line in some test is called after test return and we believe that this is causing panic. https://screenshot.googleplex.com/3tRWo69XkSfvacf. Calling t.Log before sleep is safe because we know that this line will always be called during test execution. But sleep is not interrupted by context cancel and when in test we expect that executor will exit after timeout which is less than 5sec we abandon the go routine and it can finish after whole test exits. This race happens once in prow we didn't reproduced it locally.

bowei commented 4 months ago

The construction of the test is a bit concerning as it depends very much on timing. Is there a way to redo as much of it as possible in terms of happens before relationships?

something like this:

package exec

import (
    "fmt"
    "sync"
)

func NewSync() *Sync { return &Sync{states: map[string]*state{}} }

type Sync struct {
    lock   sync.Mutex
    states map[string]*state
}

type state struct {
    done    bool
    waiters []*waiter
}

type waiter struct {
    ch chan struct{}
}

// Waiters

func (s *Sync) Wait(key string) {
    s.lock.Lock()

    st, ok := s.states[key]
    if ok && st.done {
        s.lock.Unlock()
        return
    }

    if !ok {
        st := &state{}
        s.states[key] = st
    }

    // Add to the waiters and block.
    w := &waiter{make(chan struct{})}
    st.waiters = append(st.waiters, w)

    s.lock.Unlock()
    <-w.ch
}

func (s *Sync) Signal(key string) {
    s.lock.Lock()

    st, ok := s.states[key]
    if !ok {
        s.states[key] = &state{done: true}
        s.lock.Unlock()
        return
    }

    st.done = true
    for _, w := range st.waiters {
        close(w.ch)
    }
    s.lock.Unlock()
}

func example() {
    s := NewSync()

    go func() {
        s.Wait("EVENT")
        fmt.Println("Happens after EVENT")
    }()
    go func() {
        fmt.Println("EVENT signaled")
        s.Signal("EVENT")
    }()
}

kl52752 commented 4 months ago

@bowei this test checks how executor behaves when timeout occurs so by definition this test depends on timing. I think that your proposal is over complicated for this simple test (we have only 1 action that needs to be time consuming) and it does not remove time dependency completely because still we need to control when task will be finished. I don't think that spinning another go routines with channel sin test is better solutions than using sleep here.

I added struct that checks that this (one) go routine has finished before test exits so we should not hit the issue when we trie to log after test exits.

google-oss-prow[bot] commented 4 months ago

New changes are detected. LGTM label has been removed.

bowei commented 4 months ago

I'm recommending the above code as a utility that can be used in more than one test.

The example() is just an example of how it would be used.

I recommend any of these test related sync utilities be pulled out into a utilities package rather than rewritten for every use case.

kl52752 commented 4 months ago

/retest

kl52752 commented 4 months ago

/retest

kl52752 commented 4 months ago

/retest

kl52752 commented 4 months ago

/hold sth is not working 🤔

kl52752 commented 4 months ago

/unhold

kl52752 commented 4 months ago

@bowei I fix the test with reccomended approach, PTAL

bowei commented 4 months ago

Please take a look at the comments...

kl52752 commented 4 months ago

deprecated with https://github.com/GoogleCloudPlatform/k8s-cloud-provider/pull/209

GoogleCloudPlatform / k8s-cloud-provider

Fix race condition in parallel tests. #203