coroot / coroot-node-agent

A Prometheus exporter based on eBPF that gathers comprehensive container metrics
https://coroot.com/docs/metrics/node-agent
Apache License 2.0
311 stars 55 forks source link

Segfault on MongoDB eBPF tracing #16

Closed bmike78 closed 1 year ago

bmike78 commented 1 year ago

Hello,

Running node-agent 1.8.1 with the new eBPF tracing and ran into this problem on one of our worker nodes:

I0518 13:43:59.861092   13864 registry.go:194] TCP connection from unknown container {connection-open none 3730 10.192.6.192:53932 10.192.6.192:9100 40 592331228739525 <nil>}
panic: runtime error: slice bounds out of range [4:0]

goroutine 78 [running]:
go.mongodb.org/mongo-driver/x/bsonx/bsoncore.newBufferFromReader({0x1ef17c0, 0xc000b12600})
    /go/pkg/mod/go.mongodb.org/mongo-driver@v1.11.4/x/bsonx/bsoncore/document.go:125 +0x1f4
go.mongodb.org/mongo-driver/x/bsonx/bsoncore.NewDocumentFromReader(...)
    /go/pkg/mod/go.mongodb.org/mongo-driver@v1.11.4/x/bsonx/bsoncore/document.go:101
go.mongodb.org/mongo-driver/bson.NewFromIOReader({0x1ef17c0?, 0xc000b12600?})
    /go/pkg/mod/go.mongodb.org/mongo-driver@v1.11.4/bson/raw.go:27 +0x25
github.com/coroot/coroot-node-agent/tracing.bsonToString({0x1ef17c0?, 0xc000b12600?})
    /tmp/src/tracing/mongo.go:60 +0x25
github.com/coroot/coroot-node-agent/tracing.parseMongo({0xc0009ea260, 0x200, 0x200})
    /tmp/src/tracing/mongo.go:56 +0x225
github.com/coroot/coroot-node-agent/tracing.handleMongoQuery({0xfffffffffff43bd5?, 0x203000?, 0x29be0a0?}, {0x38?, 0x8?, 0x29be0a0?}, 0xc0009ea240, {0xc000513440, 0x3, 0x3})
    /tmp/src/tracing/mongo.go:22 +0x7c
github.com/coroot/coroot-node-agent/tracing.HandleL7Request({0xc0004bda90, 0x41}, {{{0x0?, 0x0?}, 0xc000012078?}, 0x0?}, 0xc0009ea240, 0x0?)
    /tmp/src/tracing/tracing.go:76 +0x69d
github.com/coroot/coroot-node-agent/containers.(*Container).onL7Request(0xc000241560, 0x4c6d, 0x13, 0x0, 0xc0009ea240)
    /tmp/src/containers/container.go:523 +0x312
github.com/coroot/coroot-node-agent/containers.(*Registry).handleEvents(0xc000228280, 0xc000172120)
    /tmp/src/containers/registry.go:221 +0xa2d
created by github.com/coroot/coroot-node-agent/containers.NewRegistry
    /tmp/src/containers/registry.go:89 +0x4db

I'll try to track down some more details if I can about the MongoDB pod that it's trying to trace.

def commented 1 year ago

Thanks for the report! The problem is that the eBPF code sends only the first 512 bytes of the request to userspace. It turns out that the bson library does not always handle incomplete documents correctly. Will find a way to fix this.

bmike78 commented 1 year ago

The only "mongo" related pod I could find on this worker node is:

https://hub.docker.com/_/mongo-express

This is a web frontend for Mongo, so perhaps the MongoDB eBPF tracer is looking for something in this pod that doesn't exist since it's not running MongoDB.

def commented 1 year ago

The agent captures client requests to MongoDB servers, so everything appears to be correct in this case: when a mongo-express instance calls mongo, the agent captures its queries.

def commented 1 year ago

@bmike78, this has been fixed in node-agent v1.8.2. Please update the agent.

bmike78 commented 1 year ago

@def, I don't see the error in the logs anymore, but still experiencing some issues.. It's getting OOM exit code: 137 errors..

(sum by (namespace,pod,container) (rate(container_cpu_usage_seconds_total{container!=""}[5m])) / sum by (namespace,pod,container) (kube_pod_container_resource_limits{resource="cpu"})) > 0.8



* In the time since I opened this issue, the original worker node was recycled, but the issue of the OOM was still present.
* I deleted the pod for mongo-express and the OOM errors went away on that worker node.
* The mongo-express pod respawned on a different worker node and that caused the issue to move to that worker node.

The mongo-express pod is not hosting MongoDB (it's just a web ui for Mongo), so is there a way to exclude it somehow from having the eBPF tracer from attempting to collect metrics from it?
def commented 1 year ago

@bmike78, could you please collect and share the CPU and memory profiles of the affected agent?

curl -o cpu_profile.tgz 'http://<coroot-node-agent-ip>:80/debug/pprof/profile?seconds=60'
curl -o mem_profile.tgz 'http://<coroot-node-agent-ip>:80/debug/pprof/heap'
bmike78 commented 1 year ago

@def good news, potentially.. it seems that over the weekend or due to some other changes that we are not seeing the error anymore. I'm good to close out this issue. Thanks!

bmike78 commented 1 year ago

@def It looks like the issue came back, but in a different way...

We just have one node-agent out of many clusters that is showing the OOM issue.

pprof

I did gather a memory profile and loaded it into the "go tool pprof" and found that the mondo-driver was causing the high memory issue:

http://localhost:8082/ui/source?si=alloc_space http://localhost:8082/ui/peek?si=alloc_space

go.mongodb.org/mongo-driver/x/bsonx/bsoncore.NewDocumentFromReader
/go/pkg/mod/go.mongodb.org/mongo-driver@v1.11.4/x/bsonx/bsoncore/document.go

  Total:           0     1.03GB (flat, cum) 86.25%
    101            .     1.03GB           ??? 
go.mongodb.org/mongo-driver/x/bsonx/bsoncore.newBufferFromReader
/go/pkg/mod/go.mongodb.org/mongo-driver@v1.11.4/x/bsonx/bsoncore/document.go

  Total:      1.03GB     1.03GB (flat, cum) 86.25%
    121       1.03GB     1.03GB           ??? 
Showing nodes accounting for 1224.32MB, 100% of 1224.32MB total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context          
----------------------------------------------------------+-------------
                                            1056MB   100% |   go.mongodb.org/mongo-driver/x/bsonx/bsoncore.NewDocumentFromReader /go/pkg/mod/go.mongodb.org/mongo-driver@v1.11.4/x/bsonx/bsoncore/document.go:101
    1056MB 86.25% 86.25%     1056MB 86.25%                | go.mongodb.org/mongo-driver/x/bsonx/bsoncore.newBufferFromReader /go/pkg/mod/go.mongodb.org/mongo-driver@v1.11.4/x/bsonx/bsoncore/document.go:121
----------------------------------------------------------+-------------

Pod Name:

mongo-express-b8d6669fc-876pf

node-agent versions:

The segfaults did go away in v1.8.2. Running v1.8.9, which is the latest version.

bmike78 commented 1 year ago

mem_profile file.. sorry for the format, but github or my proxy is not accepting .tgz for some reason

heap.txt

I was not able to get a cpu_profile as the pod would not stay up long enough..

bmike78 commented 1 year ago

cpu_profile files.. was able to get the default "debug" and about 10s:

profile-debug.txt profile-10s.txt

http://localhost:8082/ui/peek

Showing nodes accounting for 420ms, 100% of 420ms total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context          
----------------------------------------------------------+-------------
                                              60ms 75.00% |   runtime.memclrNoHeapPointersChunked /usr/local/go/src/runtime/malloc.go:1194
                                              20ms 25.00% |   runtime.heapBits.initSpan /usr/local/go/src/runtime/mbitmap.go:794
      80ms 19.05% 19.05%       80ms 19.05%                | runtime.memclrNoHeapPointers /usr/local/go/src/runtime/memclr_amd64.s:93

http://localhost:8082/ui/?si=cpu

The CPU profile shows parseMongo and bsonToString causing the most issues which leads to memclrNoHeapPointersChunked.

Hope this helps.. if you need more memory and cpu profiles, let me know. Thanks!

def commented 1 year ago

Thanks for the details, @bmike78. I'll try to figure it out!

def commented 1 year ago

I've implemented the fix. Please update the agent to v1.8.10

bmike78 commented 1 year ago

@def, it's been 20 minutes without any OOM on v1.8.10.

Reviewing CPU/memory to see how that looks...

bmike78 commented 1 year ago

@def memory usage is down... 30MB when it was peaking at 700MB at times on 1.8.9.

I consider this resolved. Thank you!