cilium / cilium

eBPF-based Networking, Security, and Observability
https://cilium.io
Apache License 2.0
20k stars 2.94k forks source link

runtime error: invalid memory address or nil pointer dereference #7080

Closed sedooe closed 5 years ago

sedooe commented 5 years ago

Bug report

General Information

Here is the related logs:

2019-02-14T03:08:09.545736912Z level=info msg="Completed endpoint regeneration" bpfCompilation=0s buildDuration=0s containerID=dc68f922ce datapathPolicyRevision=947 desiredPolicyRevision=947 endpointID=32701 ipv4= ipv6= k8sPodName=evr/evr-database-2 mapSync=0s policyCalculation=0s prepareBuild=0s proxyConfiguration=0s proxyPolicyCalculation=0s proxyWaitForAck=0s reason="one or more identities created or deleted" subsys=endpoint waitingForCTClean=0s waitingForLock=476ns
2019-02-14T03:08:09.545768119Z level=info msg="Completed endpoint regeneration" bpfCompilation=0s buildDuration="92.254µs" containerID=cca1cba769 datapathPolicyRevision=947 desiredPolicyRevision=947 endpointID=52446 ipv4= ipv6= k8sPodName=analytics/zk-0 mapSync=0s policyCalculation=0s prepareBuild=0s proxyConfiguration=0s proxyPolicyCalculation=0s proxyWaitForAck=0s reason="one or more identities created or deleted" subsys=endpoint waitingForCTClean=0s waitingForLock=476ns
2019-02-14T03:08:09.545828527Z level=info msg="Completed endpoint regeneration" bpfCompilation=0s buildDuration="92.254µs" containerID=0baa548560 datapathPolicyRevision=947 desiredPolicyRevision=947 endpointID=49279 ipv4= ipv6= k8sPodName=gis-backend/gis-database-1 mapSync=0s policyCalculation=0s prepareBuild=0s proxyConfiguration=0s proxyPolicyCalculation=0s proxyWaitForAck=0s reason="one or more identities created or deleted" subsys=endpoint waitingForCTClean=0s waitingForLock=476ns
2019-02-14T03:08:09.549022051Z panic: runtime error: invalid memory address or nil pointer dereference
2019-02-14T03:08:09.549038414Z [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xfcda97]
2019-02-14T03:08:09.549043272Z
2019-02-14T03:08:09.549047696Z goroutine 11280099 [running]:
2019-02-14T03:08:09.54905928Z github.com/cilium/cilium/pkg/endpoint.(*endpointPolicyStatusMap).UpdateMetrics.func1(0x2612540, 0xc427d80750, 0x26e05a0, 0xc42446b1f0, 0x1)
2019-02-14T03:08:09.549064568Z     /go/src/github.com/cilium/cilium/pkg/endpoint/metrics.go:123 +0x74
2019-02-14T03:08:09.549068116Z sync.(*Map).Range(0xc420386510, 0xc429241640)
2019-02-14T03:08:09.54907276Z     /usr/local/go/src/sync/map.go:337 +0x136
2019-02-14T03:08:09.549080988Z github.com/cilium/cilium/pkg/endpoint.(*endpointPolicyStatusMap).UpdateMetrics(0xc420386510)
2019-02-14T03:08:09.549096356Z     /go/src/github.com/cilium/cilium/pkg/endpoint/metrics.go:121 +0x19b
2019-02-14T03:08:09.549103776Z github.com/cilium/cilium/pkg/endpoint.(*endpointPolicyStatusMap).Update(0xc420386510, 0x8920, 0x0, 0x4)
2019-02-14T03:08:09.549107887Z     /go/src/github.com/cilium/cilium/pkg/endpoint/metrics.go:103 +0xd4
2019-02-14T03:08:09.549112069Z github.com/cilium/cilium/pkg/endpoint.(*regenerationStatistics).SendMetrics(0xc421b161d8)
2019-02-14T03:08:09.549139495Z     /go/src/github.com/cilium/cilium/pkg/endpoint/metrics.go:50 +0x65
2019-02-14T03:08:09.549157732Z github.com/cilium/cilium/pkg/endpoint.(*Endpoint).regenerate.func1(0xc429241f28, 0xc421b161d8, 0xc4289e8900, 0xc421b161c0)
2019-02-14T03:08:09.549176358Z     /go/src/github.com/cilium/cilium/pkg/endpoint/policy.go:665 +0x105
2019-02-14T03:08:09.549181426Z github.com/cilium/cilium/pkg/endpoint.(*Endpoint).regenerate(0xc4289e8900, 0x2bf9020, 0xc42210a000, 0xc421b161c0, 0x0, 0x0)
2019-02-14T03:08:09.549186172Z     /go/src/github.com/cilium/cilium/pkg/endpoint/policy.go:806 +0xb53
2019-02-14T03:08:09.549191568Z github.com/cilium/cilium/pkg/endpoint.(*Endpoint).Regenerate.func1(0xc421b161c0, 0x2bf9020, 0xc42210a000, 0xc42406bf00, 0xc4289e8900)
2019-02-14T03:08:09.549210804Z     /go/src/github.com/cilium/cilium/pkg/endpoint/policy.go:842 +0x1fc
2019-02-14T03:08:09.549216132Z created by github.com/cilium/cilium/pkg/endpoint.(*Endpoint).Regenerate
2019-02-14T03:08:09.549221144Z     /go/src/github.com/cilium/cilium/pkg/endpoint/policy.go:820 +0x123
2019-02-14T03:08:09.575528727Z level=fatal msg=“Agent pipe unexpectedly closed, shutting down” subsys=cilium-node-monitor

After this, Cilium pod restarted and some network timeouts started in this node. Our guess is that the reason for the network disruption is because we didn't mount the BPF FS but we can't verify that.

We were not able to reproduce the issue.

nebril commented 5 years ago

I was unable to reproduce it at all, the line that panicked is here: https://github.com/cilium/cilium/blob/v1.3.2/pkg/endpoint/metrics.go#L123

policyStatus[epPolicyStatus]++

I suspect that between lines 122 and 123 GC occurred and caused epPolicyStatus underlying byte slice to be freed (since sync.Map stores the data in unsafe.Pointer). Strings created from unsafe pointers use the same underlying byte slice (proof here: https://play.golang.org/p/glZDZ9srpSM ).

Proposed fix would be to change endpoint.endpointPolicyStatusMap to be a normal map with locks instead of sync.Map.

aanm commented 5 years ago

@nebril that makes total sense.

A google search returns this result :D https://github.com/cilium/cilium/pull/5621#issuecomment-423976222

Also, it seems the map doesn't need to be initialized

 func (epPolicyMaps *endpointPolicyStatusMap) UpdateMetrics() {
-       policyStatus := map[models.EndpointPolicyEnabled]float64{
-               models.EndpointPolicyEnabledNone:    0,
-               models.EndpointPolicyEnabledEgress:  0,
-               models.EndpointPolicyEnabledIngress: 0,
-               models.EndpointPolicyEnabledBoth:    0,
-       }
+       policyStatus := map[models.EndpointPolicyEnabled]float64{}

@eloycoto it seems in update we call the endpointPolicyStatus.UpdateMetrics() but not on delete. Is this a bug?

https://github.com/cilium/cilium/blob/fc1bac1f79bdabe77452f2c8e0793a0e2255e2ea/pkg/endpoint/metrics.go#L93-L104

@sedooe The stack trace corresponds to the v1.3.0 code base, be aware that you might be running cilium v1.3.0 in your cluster without realizing that is not v1.3.2

sedooe commented 5 years ago

@aanm You're right, it was my bad to write v1.3.2, we downgraded the v1.3.0 couple of days ago. Will edit it to not confuse others.