Closed sedooe closed 5 years ago
I was unable to reproduce it at all, the line that panicked is here: https://github.com/cilium/cilium/blob/v1.3.2/pkg/endpoint/metrics.go#L123
policyStatus[epPolicyStatus]++
I suspect that between lines 122 and 123 GC occurred and caused epPolicyStatus
underlying byte slice to be freed (since sync.Map
stores the data in unsafe.Pointer
). Strings created from unsafe pointers use the same underlying byte slice (proof here: https://play.golang.org/p/glZDZ9srpSM ).
Proposed fix would be to change endpoint.endpointPolicyStatusMap
to be a normal map with locks instead of sync.Map
.
@nebril that makes total sense.
A google search returns this result :D https://github.com/cilium/cilium/pull/5621#issuecomment-423976222
Also, it seems the map doesn't need to be initialized
func (epPolicyMaps *endpointPolicyStatusMap) UpdateMetrics() {
- policyStatus := map[models.EndpointPolicyEnabled]float64{
- models.EndpointPolicyEnabledNone: 0,
- models.EndpointPolicyEnabledEgress: 0,
- models.EndpointPolicyEnabledIngress: 0,
- models.EndpointPolicyEnabledBoth: 0,
- }
+ policyStatus := map[models.EndpointPolicyEnabled]float64{}
@eloycoto it seems in update we call the endpointPolicyStatus.UpdateMetrics()
but not on delete. Is this a bug?
@sedooe The stack trace corresponds to the v1.3.0 code base, be aware that you might be running cilium v1.3.0 in your cluster without realizing that is not v1.3.2
@aanm You're right, it was my bad to write v1.3.2, we downgraded the v1.3.0 couple of days ago. Will edit it to not confuse others.
Bug report
General Information
1.3.0
Linux gentle-mole 4.15.0-45-generic #48-Ubuntu SMP Tue Jan 29 16:28:13 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Kubernetes 1.12.2
withContainerd 1.2.0
Here is the related logs:
After this, Cilium pod restarted and some network timeouts started in this node. Our guess is that the reason for the network disruption is because we didn't mount the BPF FS but we can't verify that.
We were not able to reproduce the issue.