DataDog / datadog-process-agent

Datadog Process Agent
https://datadoghq.com
20 stars 9 forks source link

Avoid copying the connection tuple #294

Closed kevinconaway closed 5 years ago

kevinconaway commented 5 years ago

Instead temporarily reset the pid to 0 for the map lookup.

From looking at a busy host, this was the top source of allocations in the network tracer

(pprof) top
Showing nodes accounting for 26751.13MB, 90.99% of 29401.61MB total
Dropped 66 nodes (cum <= 147.01MB)
Showing top 10 nodes out of 44
      flat  flat%   sum%        cum   cum%
 8036.64MB 27.33% 27.33% 13299.84MB 45.24%  github.com/DataDog/datadog-process-agent/ebpf.(*Tracer).getConnections
 6389.93MB 21.73% 49.07%  6389.93MB 21.73%  github.com/DataDog/datadog-process-agent/ebpf.(*networkState).mergeConnections
 3369.93MB 11.46% 60.53%  3371.43MB 11.47%  github.com/DataDog/datadog-process-agent/vendor/github.com/mailru/easyjson/buffer.getBuf
 3330.19MB 11.33% 71.86%  3340.71MB 11.36%  github.com/DataDog/datadog-process-agent/vendor/github.com/mailru/easyjson/buffer.(*Buffer).BuildBytes
 1538.06MB  5.23% 77.09%  3018.12MB 10.27%  github.com/DataDog/datadog-process-agent/vendor/github.com/iovisor/gobpf/elf.(*Module).LookupElement
 1399.04MB  4.76% 81.85%  1691.55MB  5.75%  github.com/DataDog/datadog-process-agent/ebpf.getConnsByKey
  860.55MB  2.93% 84.77%   861.05MB  2.93%  fmt.Sprintf
  656.03MB  2.23% 87.00%  1744.57MB  5.93%  github.com/DataDog/datadog-process-agent/vendor/github.com/iovisor/gobpf/elf.(*Module).LookupNextElement
  650.53MB  2.21% 89.22%   650.53MB  2.21%  github.com/DataDog/datadog-process-agent/ebpf.(*ConnTuple).copy (inline)
  520.22MB  1.77% 90.99%   552.22MB  1.88%  github.com/DataDog/datadog-process-agent/ebpf.(*networkState).StoreClosedConnection
(pprof) list getConnections
Total: 28.71GB
ROUTINE ======================== github.com/DataDog/datadog-process-agent/ebpf.(*Tracer).getConnections in ebpf/tracer.go
    7.85GB    12.99GB (flat, cum) 45.24% of Total
         .          .    220:
         .          .    221:   for _, key := range closedPortBindings {
         .          .    222:       t.portMapping.RemoveMapping(key)
         .          .    223:       _ = t.m.DeleteElement(portMp, unsafe.Pointer(&key))
         .          .    224:   }
         .   512.09kB    225:
         .          .    226:   return active, latestTime, nil
         .          .    227:}
         .          .    228:
         .          .    229:func (t *Tracer) removeEntries(mp, tcpMp *bpflib.Map, entries []*ConnTuple) {
         .          .    230:   for i := range entries {
         .          .    231:       err := t.m.DeleteElement(mp, unsafe.Pointer(entries[i]))
         .          .    232:       if err != nil {
         .          .    233:           // It's possible some other process deleted this entry already (e.g. tcp_close)
         .          .    234:           log.Warnf("failed to remove entry from connections map: %s", err)
         .     1.70GB    235:       }
         .          .    236:
         .          .    237:       // We have to remove the PID to remove the element from the TCP Map since we don't use the pid there
         .          .    238:       entries[i].pid = 0
   27.11MB    57.12MB    239:       // We can ignore the error for this map since it will not always contain the entry
         .          .    240:       _ = t.m.DeleteElement(tcpMp, unsafe.Pointer(entries[i]))
         .     3.28GB    241:   }
         .          .    242:}
         .          .    243:
         .          .    244:// getTCPStats reads tcp related stats for the given ConnTuple
         .          .    245:func (t *Tracer) getTCPStats(mp *bpflib.Map, tuple *ConnTuple) *TCPStats {
         .          .    246:   // Remove the PID since we don't use it in the TCP Stats map
    7.82GB     7.82GB    247:   tup := tuple.copy()
         .          .    248:   tup.pid = 0
         .          .    249:
         .          .    250:   stats := &TCPStats{retransmits: 0}
         .          .    251:   if err := t.m.LookupElement(mp, unsafe.Pointer(tup), unsafe.Pointer(stats)); err != nil {
         .          .    252:       return stats
         .          .    253:   }
         .   132.01MB    254:
         .          .    255:   return stats
         .          .    256:}
         .          .    257:
         .          .    258:// getLatestTimestamp reads the most recent timestamp captured by the eBPF
         .          .    259:// module.  if the eBFP module has not yet captured a timestamp (as will be the

@DataDog/burrito

sunhay commented 5 years ago

Are there any valid cases for .copy()? Perhaps we can just remove the function completely?

kevinconaway commented 5 years ago

Are there any valid cases for .copy()? Perhaps we can just remove it?

It appears unused. @sfluor can we get rid of this ?

sfluor commented 5 years ago

I think it's used to build the list of connections to expire (because we use a pointer to iterate through the eBPF map) here: https://github.com/DataDog/datadog-process-agent/blob/master/ebpf/tracer.go#L204

If we change this behavior it may be safe to remove 👍

kevinconaway commented 5 years ago

It seems like we need to keep it then. We can't modify the map while we're iterating it, right ?