etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
47.92k stars 9.78k forks source link

Close of already closed channel causes a panic #18853

Open szaydel opened 2 weeks ago

szaydel commented 2 weeks ago

Bug report criteria

What happened?

Organization I work for leverages etcd as a building block of a service, which complicates this matter a bit. There was a panic observed which from the looks of it does not appear to be caused by our service, rather it seems to be in etcd. Unfortunately, while I have no problem sharing the stack trace, the code is not publicly available. We do not modify any of the etcd code, thus I am confident that the problem is not with modified code, and from the looks of it appears to exist in this repository. I did not observe this directly, but my understanding is that this was observed when the service was asked to shutdown. This description seems to align well with the code.

We build an operating system on the illumos kernel, which is not well known. It is not Linux, and I understand that there should be no expectation of support. However, I think this issue can affect anyone, thus I am raising it.

From the looks of the stack the problem is really in how the channel closing is implemented. I am going to reference more recent version of the file, because this code remained the same for quite some time. We can see here: https://github.com/etcd-io/etcd/blob/995027f5c1363404e86f7a858ea2833df01f0954/server/embed/etcd.go#L458 that we check if the channel is not nil, in which case we issue close(...). However, the channel is never set to nil, which means it is closed, but the if e.errc != nil {...} check will still evaluate to true and would then allow for the close(...) to happen again. It seems to me that minimally e.errc = nil should be added inside of the block so that the check evaluated to true once only.

Perhaps e.errc is not set to nil intentionally. It could be that this is done as a defensive measure to make sure that some consumer does not end-up stuck on a now blocked channel. I can understand this, though debatably select with a default case should be used. That being said, interestingly, a little up above there is this piece of code: https://github.com/etcd-io/etcd/blob/995027f5c1363404e86f7a858ea2833df01f0954/server/embed/etcd.go#L409, which will only be called once, thanks to the sync.Once variable. I am wondering if the function which is clearly meant to support cleanup and be executed once isn't a better place for this code. Since this is guaranteed to be called once only, there will be no need to reset the channel to a nil value.

It feels to me like this is a good improvement opportunity that anyone, not just us, would benefit from. I am happy to put up a PR, but would like to first hear whether there is interest in making this change and if so, whether we should just set the channel to nil, or consider moving into the function at line 409.

Here's the stack. My apologies as I cannot share anything to do with go.racktop.io.

panic: close of closed channel

goroutine 1 [running]:
go.etcd.io/etcd/server/v3/embed.(*Etcd).Close(0xc000799800)
    go.etcd.io/etcd/server/v3@v3.5.9/embed/etcd.go:437 +0x7ea
go.racktop.io/bsr/confd/rtetcd.(*EtcdServer).Stop(0xc000598500)
    go.racktop.io/bsr/confd/rtetcd/server.go:508 +0xa6
go.racktop.io/bsr/racktop/service.(*Service).stop(0xc000683860, 0x16f8015?)
    go.racktop.io/bsr/racktop/service/Service.go:833 +0x26d
go.racktop.io/bsr/racktop/service.(*Service).run(0xc000683860, 0x0)
    go.racktop.io/bsr/racktop/service/Service.go:395 +0x4fc
go.racktop.io/bsr/racktop/service.(*Service).bootstrap(...)
    go.racktop.io/bsr/racktop/service/Service_unix.go:21
go.racktop.io/bsr/racktop/service.(*Service).Run(0xc000683860)
    go.racktop.io/bsr/racktop/service/Service.go:299 +0xe5
main.main()
    go.racktop.io/bsr/confd/svc/confd/AppMain.go:107 +0x29

What did you expect to happen?

Did not expect to see this panic. Expected orderly shutdown.

How can we reproduce it (as minimally and precisely as possible)?

Unfortunately, this does not seem to be reliably reproducible. I could not find a way of triggering this in a reproducible way.

Anything else we need to know?

No response

Etcd version (please run commands below)

```console $ etcd --version # paste output here $ etcdctl version # paste output here ```

Etcd configuration (command line flags or environment variables)

# paste your configuration here

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

```console $ etcdctl member list -w table # paste output here $ etcdctl --endpoints= endpoint status -w table # paste output here ```

Relevant log output

No response

ahrtr commented 2 weeks ago

Do you know how exactly the panicking was triggered? It would be great if you could create an e2e or integration test to reproduce this. It looks like the Close method was executed twice

https://github.com/etcd-io/etcd/blob/b12540977a9cceb6ad37b6ffafa262fb667bd805/server/embed/etcd.go#L390

which will only be called once, thanks to the sync.Once variable.

I agree using sync.Once is a simple fix to ensure it is only executed once.

szaydel commented 2 weeks ago

@ahrtr, thank you for responding quickly. I agree, it seems the only way this could happen is if the Close() method was called more than once. I do not know how this happened and have been trying to figure this out myself. Let me have a closer look at the code that is lower in the stack and see if I can come up with anything. What makes this all the more difficult is that this issue occurred on a system that is outside of my purview. I only have access to its logs. But I am going to see what if anything I can piece together.

szaydel commented 1 week ago

@ahrtr, I had a chat with my colleagues and they believe there is a chance that our internal service framework could have executed its own stop() method more than once, which would end up calling Close() more than once, and could put us into this situation. This issue has been resolved recently and there is now a guarantee that stop cannot happen more than once. Which I believe means that Close() cannot be called more than once. I do not have a good way of reproducing this issue. I am not sure it makes sense to keep this open unless we want to make any changes to the code.

It is possible that this will happen again, but if the issue was indeed in our service code and it has been fixed, once that code is in production we may never see this again. And waiting for something that may never happen again is probably not a good strategy.

I would still recommend moving the channel closure into the function invoked by e.closeOnce.Do(...). I am happy to do that work and put up a PR.

ahrtr commented 5 days ago

I would still recommend moving the channel closure into the function invoked by e.closeOnce.Do(...). I am happy to do that work and put up a PR.

Please feel free to deliver a PR and we will evaluate & review the change/impact. Thanks

szaydel commented 5 days ago

That sounds good to me, @ahrtr. Will do.