cortexproject / cortex

A horizontally scalable, highly available, multi-tenant, long term Prometheus.
https://cortexmetrics.io/
Apache License 2.0
5.48k stars 801 forks source link

Unit tests leak goroutines #3962

Open stevesg opened 3 years ago

stevesg commented 3 years ago

Spent the morning debugging what I thought was a goroutine leak, but turned out to be just a huge number of goroutines being created in a short period of time. However, it seems a lot of the unit tests do leak goroutines. Most of these might just be issues with tests not cleaning up properly, but it makes testing/debugging real leaks quite difficult. Or in my case, the leaks were a red herring - the issue was something else.

Given we already have goleak vendored in presumably from Prometheus (they use it), it would be nice to start using it.

We leak ~2000 goroutines per full run of the unit tests. Summary below, full data here: unit-test-leaks.txt

top 5 leakers by package

928 pkg/distributor
224 pkg/ingester
147 pkg/chunk/storage
122 pkg/compactor
81  pkg/querier

top 5 leakers by func

1153    github.com/cortexproject/cortex/pkg/ring/kv/consul.(*mockKV).loop
109 internal/poll.runtime_pollWait
68  runtime.goparkunlock
65  github.com/prometheus/prometheus/tsdb/wal.(*WAL).run
62  google.golang.org/grpc.(*addrConn).resetTransport

(Changes used are here - https://github.com/stevesg/cortex/commit/b0a95a22cf06f441aa4e4f830df63d14defe83db )

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

pracucci commented 3 years ago

Still valid

aksharma53 commented 3 years ago

@stevesg I wants to work on this issue could you please tell how I can reproduce this issue and which file are associated to fix this issue ?

bboreham commented 3 years ago

@Alphasaurs thanks for volunteering!
If you take the latest code and cherry-pick Steve's commit here: https://github.com/stevesg/cortex/commit/b0a95a22cf06f441aa4e4f830df63d14defe83db then go test pkg/distributor should show part of the problem.

Let us know how you get on.

aksharma53 commented 3 years ago

@bboreham I am new to goroutine leak and I found the leaks by the instruction provided by you but still I don't know how can I solve them I am sharing docs file could you please comment on it for the few codes so what has to be done to solve this leaks and how are they are leaking it will be helpful https://docs.google.com/document/d/17nID6r_71yyleHSVk6tw3pP55aqIYDDUQKrEOnj0A1M/edit?usp=sharing

bboreham commented 3 years ago

Broadly the pattern is that you have to find what is starting the goroutines, find out how they are supposed to be stopped, and add or fix the code in the tests that will stop them.

You should really focus on the cases like Steve noted where hundreds of goroutines are left running; your Google Doc seems to list single ones which we can get to later.

However I can use one example from your doc:

for {
       select {
       case <-ticker.C:
           err := om.loadConfig()
           if err != nil {
               // Log but don't stop on error - we don't want to halt all ingesters because of a typo
               level.Error(util_log.Logger).Log("msg", "failed to load config", "err", err)
           }
       case <-ctx.Done():
           return nil
       }
   }

The for will loop forever, which is expected to be stopped by reading from ctx.Done(). This means some other code should call cancel on that context at the end of the test.

Often we do this in "cleanup" routines added to a test, like this: https://github.com/cortexproject/cortex/blob/34bb142b129bd781539001fb1629cf2f3ed54b82/pkg/storage/tsdb/bucketindex/loader_test.go#L42-L44

I looked into the biggest number, in distributor_tests.go, and proposed #4506.

ogm256 commented 2 years ago

Hi all,

based on @stevesg 's suggestion I've created a pull request (#4577) to integrate goleak.VerifyTestMain() into every test. Bu I made this optional. This let us test for unit tests only if desired and it won't break any other "normal" unit tests.

First I tried to make the optional run of goleak.VerifyTestMain() selectable by a command line switch or a flag. This was all a bit confusing and may break some future extensions. So I decided to make is selectable by an environment variable called CORTEX_TEST_GOLEAK. If that variable has been set to 1 goleak.VerifyTestMain() will run while testing. If it hasn't been set it won't run and all will work as before.

Regards, Oliver