Open szaydel opened 2 weeks ago
Do you know how exactly the panicking was triggered? It would be great if you could create an e2e or integration test to reproduce this. It looks like the Close
method was executed twice
which will only be called once, thanks to the
sync.Once
variable.
I agree using sync.Once
is a simple fix to ensure it is only executed once.
@ahrtr, thank you for responding quickly. I agree, it seems the only way this could happen is if the Close()
method was called more than once. I do not know how this happened and have been trying to figure this out myself. Let me have a closer look at the code that is lower in the stack and see if I can come up with anything. What makes this all the more difficult is that this issue occurred on a system that is outside of my purview. I only have access to its logs. But I am going to see what if anything I can piece together.
@ahrtr, I had a chat with my colleagues and they believe there is a chance that our internal service framework could have executed its own stop()
method more than once, which would end up calling Close()
more than once, and could put us into this situation. This issue has been resolved recently and there is now a guarantee that stop cannot happen more than once. Which I believe means that Close()
cannot be called more than once. I do not have a good way of reproducing this issue. I am not sure it makes sense to keep this open unless we want to make any changes to the code.
It is possible that this will happen again, but if the issue was indeed in our service code and it has been fixed, once that code is in production we may never see this again. And waiting for something that may never happen again is probably not a good strategy.
I would still recommend moving the channel closure into the function invoked by e.closeOnce.Do(...)
. I am happy to do that work and put up a PR.
I would still recommend moving the channel closure into the function invoked by
e.closeOnce.Do(...)
. I am happy to do that work and put up a PR.
Please feel free to deliver a PR and we will evaluate & review the change/impact. Thanks
That sounds good to me, @ahrtr. Will do.
Bug report criteria
What happened?
Organization I work for leverages etcd as a building block of a service, which complicates this matter a bit. There was a panic observed which from the looks of it does not appear to be caused by our service, rather it seems to be in etcd. Unfortunately, while I have no problem sharing the stack trace, the code is not publicly available. We do not modify any of the etcd code, thus I am confident that the problem is not with modified code, and from the looks of it appears to exist in this repository. I did not observe this directly, but my understanding is that this was observed when the service was asked to shutdown. This description seems to align well with the code.
We build an operating system on the illumos kernel, which is not well known. It is not Linux, and I understand that there should be no expectation of support. However, I think this issue can affect anyone, thus I am raising it.
From the looks of the stack the problem is really in how the channel closing is implemented. I am going to reference more recent version of the file, because this code remained the same for quite some time. We can see here: https://github.com/etcd-io/etcd/blob/995027f5c1363404e86f7a858ea2833df01f0954/server/embed/etcd.go#L458 that we check if the channel is not
nil
, in which case we issueclose(...)
. However, the channel is never set tonil
, which means it is closed, but theif e.errc != nil {...}
check will still evaluate totrue
and would then allow for theclose(...)
to happen again. It seems to me that minimallye.errc = nil
should be added inside of the block so that the check evaluated totrue
once only.Perhaps
e.errc
is not set tonil
intentionally. It could be that this is done as a defensive measure to make sure that some consumer does not end-up stuck on a now blocked channel. I can understand this, though debatablyselect
with a default case should be used. That being said, interestingly, a little up above there is this piece of code: https://github.com/etcd-io/etcd/blob/995027f5c1363404e86f7a858ea2833df01f0954/server/embed/etcd.go#L409, which will only be called once, thanks to thesync.Once
variable. I am wondering if the function which is clearly meant to support cleanup and be executed once isn't a better place for this code. Since this is guaranteed to be called once only, there will be no need to reset the channel to anil
value.It feels to me like this is a good improvement opportunity that anyone, not just us, would benefit from. I am happy to put up a PR, but would like to first hear whether there is interest in making this change and if so, whether we should just set the channel to
nil
, or consider moving into the function at line 409.Here's the stack. My apologies as I cannot share anything to do with
go.racktop.io
.What did you expect to happen?
Did not expect to see this panic. Expected orderly shutdown.
How can we reproduce it (as minimally and precisely as possible)?
Unfortunately, this does not seem to be reliably reproducible. I could not find a way of triggering this in a reproducible way.
Anything else we need to know?
No response
Etcd version (please run commands below)
Etcd configuration (command line flags or environment variables)
Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)
Relevant log output
No response