Closed hongkuancn closed 1 month ago
Hey @hongkuancn, thanks for opening this issue. It looks like this send inside maybeStopIdleWorker()
should be skipped if the context has been canceled already. E.g.:
// maybeStopIdleWorker attempts to stop an idle worker by sending it a nil task
func (p *WorkerPool) maybeStopIdleWorker() bool {
if decremented := p.decrementWorkerCount(); !decremented {
return false
}
// If the pool context has been canceled the tasks channel could be closed, so sending a nil to it will panic
select {
case p.context.Done():
return false
default:
}
// Send a nil task to stop an idle worker
p.tasks <- nil
return true
}
What do you think? Feel free to open a pull request with these changes and I'll be glad to merge it.
Hey @alitto ! Thanks for the suggestion. I noticed there might still be synchronization issue after your change
stop purge
β
β
β new select
βββββββββββββββββ
cancel contextβ
βββββββββββββββΊβ
β
β
close channel β
βββββββββββββββΊβ task <- nil
βββββββββββββββββ
β
βΌ
I just looked back the history and come up with an idea.
The issue is probably brought in by Pull Request #62. It wants to stop the pool with long running tasks when the context is canceled. However, when draining the tasks, it depends on a closed task channel. Worker goroutines are blocked because of this task channel. So the PR moves close(p.tasks)
before p.workersWaitGroup.Wait()
in stop()
method. But this move can't guarantee the purge goroutine stopped, which leads to the data race.
So I suggest to move close(p.tasks)
back and there is another unblocked way to drain the tasks instead of relying on a closed task channel.
func drainTasks(tasks <-chan func(), tasksWaitGroup *sync.WaitGroup) {
for {
select {
case task, ok := <-tasks:
if task != nil && ok {
tasksWaitGroup.Done()
}
default:
return
}
}
}
So after the change, the stop()
method has the following steps:
What do you think? Please let me know if anything I missed out.
That makes a lot of sense, yes. I overlooked that change in https://github.com/alitto/pond/pull/62. As a general rule of thumb, a writable channel shared with N goroutines can only be closed after all of them have returned. I have merged both of your PRs and released them as part of v1.9.2 :rocket: Thank you for your contributions!
Hi!
I came across a data race with the test
It seems the issue is related to
purge()
send nil task to the task channel to recycle the workers. Meanwhile,stop()
close the task channel. The race detector will report the issue according to Unsynchronized send and close operations | Data Race Detector - The Go Programming Language