DataDog / datadog-agent

Main repository for Datadog Agent
https://docs.datadoghq.com/
Apache License 2.0
2.86k stars 1.2k forks source link

[BUG] Datadog Agent Causes Docker Buildkit to Fail Unmounting OverlayFS Layers #18228

Closed adawalli closed 10 months ago

adawalli commented 1 year ago

Agent Environment Agent 7.46.0 - Commit: b2f5e36 - Serialization version: v5.0.85 - Go version: go1.19.10

Describe what happened: Datadog agent has been causing buildkit failures in kubernetes as discussed in https://github.com/moby/buildkit/issues/3812

Something in the datadog agent is monitoring the overlayfs layers during a build which causes buildkit to fail when it cannot unmount those layers. With datadog helm chart installed in nearly a completely vanilla fashion, we can cause failures nearly 100% of the time.

Completely disabling datadog causes passes 100% of the time. So far, we have also seen 100% passes by turning off Universal Service Monitoring, but more testing is required to confirm this.

Sample Error message

ERROR: failed to solve: failed to compute cache key: failed to unmount /tmp/containerd-mount1594417594: failed to unmount target /tmp/containerd-mount1594417594: device or resource busy

Describe what you expected: Datadog should not be causing DIND buildkit builds to fail.

Additional environment details (Operating System, Cloud provider, etc):

guyarb commented 1 year ago

Hey @adawalli, it was great talking with you. Let me know if the solution worked and if we can close the ticket.

nuzayets commented 1 year ago

There are many of us experiencing this issue.

guyarb commented 1 year ago

Hey @nuzayets, please open a zendesk ticket so we can address and help out.

rafaelgaspar commented 1 year ago

Any change this workaround will be made public? Like @nuzayets said there are multiple customers experiencing this.

adawalli commented 1 year ago

@guyarb - we have not seen the issue again after implementing your fix. Also would like to see this publicly documented (and fixed if possible)

nuzayets commented 1 year ago

@adawalli What was the fix?

adawalli commented 1 year ago

Was on vacation, sorry for late response.

Disabling service monitoring did the trick for us. This was acceptable in this one cluster where we are running gitlab jobs, however, I am hoping a more proper fix is submitted by datadog.

    datadog:
      serviceMonitoring:
        enabled: false
nuzayets commented 1 year ago

Thank you!

sigwinch28 commented 11 months ago

Any progress on a fix for this without disabling the monitoring?

guyarb commented 11 months ago

Thanks for bumping it @sigwinch28

Actually, there is no need to disable USM monitoring, but to disable https monitoring instead.

agents:
  containers:
    systemProbe:
      env:
        - name: DD_SYSTEM_PROBE_NETWORK_ENABLE_HTTPS_MONITORING
          value: "false"

Regarding a fix, we're still working on it.

guyarb commented 10 months ago

Hey folks, earlier today we released a new agent version 7.50.0 (and 6.50.0) which contains a fix to the problem above, and we're not ignoring buildkit process from our HTTPs hooking mechanism. I do encourage you to upgrade the version and re-enable the feature.