DataDog / datadog-process-agent

Datadog Process Agent
https://datadoghq.com
20 stars 9 forks source link

network-tracer: Fix leaky stats objects that resulted in underflows #311

Closed sfluor closed 5 years ago

sfluor commented 5 years ago

I took a dump of the state / added debug logs to the tracer and got:

--------- network-tracer logs --------------------

network-tracer.log:11:2019-05-16 14:36:57 UTC | NETWORK | ERROR | (ebpf/state.go:305 in updateConnWithStats) | underflow occured ! p:6687|src:172.21.58.160:9000|dst:172.21.3.190:22662|f:0|t:0

---------- ebpf logs -----------------------------

ebpf.logs:64486:         foo-6740  [003] d... 674427.956290: : closing conn sport:9000, dport:22662, daddr_l:-1107094100
ebpf.logs:97361:         foo-6696  [001] d... 674572.249423: : closing conn sport:9000, dport:22662, daddr_l:-1107094100

---------- state events dump ---------------------

CLOSE: {"pid":6687,"type":0,"family":0,"net_ns":4026531957,"source":"172.21.58.160","dest":"172.21.3.190","sport":9000,"dport":22662,"direction":1,"monotonic_sent_bytes":134,"last_sent_bytes":0,"monotonic_recv_bytes":4254,"last_recv_bytes":0,"monotonic_retransmits":0,"last_retransmits":0,"last_update_epoch":674411329457181}

CONNECTION: {"pid":6687,"type":0,"family":0,"net_ns":4026531957,"source":"172.21.58.160","dest":"172.21.3.190","sport":9000,"dport":22662,"direction":1,"monotonic_sent_bytes":0,"last_sent_bytes":0,"monotonic_recv_bytes":4254,"last_recv_bytes":0,"monotonic_retransmits":0,"last_retransmits":0,"last_update_epoch":674411316811701}

CLOSE: {"pid":6687,"type":0,"family":0,"net_ns":4026531957,"source":"172.21.58.160","dest":"172.21.3.190","sport":9000,"dport":22662,"direction":1,"monotonic_sent_bytes":177,"last_sent_bytes":0,"monotonic_recv_bytes":1283,"last_recv_bytes":0,"monotonic_retransmits":0,"last_retransmits":0,"last_update_epoch":674555618757552}

In this case the time between the CONNECTION event and the second CLOSE event is ~140s which is less than 2mn + 30s (our expiry limit for TCP connections + our polling interval), that's why the connection was not considered expired by the userspace code.

The main issue is that the second close event has less stats than the connection event (without a connection close between the two), and there is only 2 close events for this daddr, dport couple in the eBPF logs (that relates to the 2 we have in the state dump).

(I also had a script running that took a dump of the eBPF map every 15 seconds and it didn't have any connection for this (daddr, dport) couple)

There is likely something wrong with how we deal with close events but I don't know what's the issue.

I thought it was coming from close events we would drop here https://github.com/DataDog/datadog-process-agent/blob/master/ebpf/c/tracer-ebpf.c#L557 but I added a debug log there and it was not showing up so it does not seem to be the issue.

The goal of this PR is to at least fix the issue by clearing the old stats objects when we see a connection close event (except if the connection became active again). The issue could still appear if the interval between the CONNECTION and CLOSE event is smaller than our polling interval (30s).

It also fixes an other case where we could have underflows on closed connections becoming active again.