alitto / pond

🔘 Minimalistic and High-performance goroutine worker pool written in Go
MIT License
1.43k stars 60 forks source link

Deadlock on zero minimum workers #33

Closed sermojohn closed 1 year ago

sermojohn commented 1 year ago

I experienced a deadlock when using pond in the following way (https://go.dev/play/p/eJLX1vc3C81).

workerpool := pond.New(10, 10, pond.Strategy(pond.Eager()))

batchWg := &sync.WaitGroup{}
batch := []string{"message1", "message2"}

for _, _ = range batch {
    batchWg.Add(1)
    workerpool.Submit(func() {
        batchWg.Done()
    })
}

batchWg.Wait()

Minimum workers defaults to 0 and purge can stop idle workers IdleWorkers() > 0. At the same time, workerpool.Submit can add a task but not start a worker because IdleWorkers() > 0. If the purger managed to signal the worker to stop before the newly submitted job is consumed by the worker, the workerpool ends up with a non-empty task channel and no workers to process the tasks.

Possible solutions:

  1. Increase minimum workers >= 1. This way there will always be an available worker to process pending tasks.
  2. Improve synchronisation between purge and maybeStartWorker to avoid such cases.
alitto commented 1 year ago

Hey @sermojohn! I think you are onto something here, this purge function presented the issue you describe in earlier versions of the lib, but after running your snippet I wasn't able to reproduce the deadlock. I'll take a closer look over the weekend, but in the meantime, if you could help me come up with a sample that enters a deadlock in every run that would greatly help me debug this issue. I made some modifications to your snippet to log how many tasks are pending after the program completes, how many idle/running workers are there before submitting another task, and also reduced the IdleTimeout to 1 microsecond to ensure the purge function runs more often (to force the deadlock). Here's the modified version: https://go.dev/play/p/hw1iGibUVDb

Thanks for opening this issue! 🙂

alitto commented 1 year ago

Hey @sermojohn,

I was able to reproduce this issue within a unit test and found a solution for this scenario by ensuring maybeStartWorker is synchronised with purge, as you pointed out.
This is the PR with the fix: https://github.com/alitto/pond/pull/34 and I just released version v1.8.1, which includes it. Please check it out if you have some time and let me know how that goes :slightly_smiling_face:

Thanks!

sermojohn commented 1 year ago

@alitto, thank you for your prompt fix! This library is proved to be the right tool for the job. Feel free to close this issue, it looks fixed from my perspective.

I hope to manage to contribute back to this library in the future!