TraceMachina / nativelink

NativeLink is an open source high-performance build cache and remote execution server, compatible with Bazel, Buck2, Reclient, and other RBE-compatible build systems. It offers drastically faster builds, reduced test flakiness, and specialized hardware.
https://nativelink.com
Apache License 2.0
1.19k stars 117 forks source link

Implement native OpenTelemetry infrastructure #1461

Open aaronmondal opened 2 weeks ago

aaronmondal commented 2 weeks ago

This commit adds the OTLP exports to Nativelink and extends the nativelink deployments in the operator with OpenTelemetryCollector sidecars. The exposed traces, metrics and logs are published through Kafka to NATS Jetstream.


This change is Reviewable

aaronmondal commented 2 weeks ago

cc @allada @SchahinRohani You might want to play around with this while it's in preview.

@allada One thing that we'll need to figure out is where to put the OtlpServer in the nativelink-config. At the moment it requires a dummy listener which might not play well with workers.

@SchahinRohani You might want to look into OTLP, Kafka topics and NATS Jetstream. This initial implementation doesn't add structure, but at least it provides a central point to aggregate the logs, traces and metrics of nativelink deployments.

I'll polish this a bit, but for now we have the following (assuming the pod for the nativelink-cas is e.g. kubectl port-forward nativelink-cas-ff6544bb8-v4w86)

# Telemetry (CPU usage etc)
kubectl port-forward nativelink-cas-ff6544bb8-v4w86 8888
curl localhost:8888/metrics

# The info that was previously in the experimental_prometheus endpoint
kubectl port-forward nativelink-cas-ff6544bb8-v4w86 8888
curl localhost:8889/metrics

# The aggregated stream across cas, scheduler and worker deployments
kubectl exec -n nats-system nats-box-5d4d987f5b-thbfs -- nats stream info TELEMETRY

(also I still have a small bug in the kustomization. You'll need to apply it twice. The one I use is this variant of the deploy/dev operator:

native up

# Then modify kubernetes/deploy/dev to this and run `kubectl apply -k deploy/dev`:

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

components:
- ../../kubernetes/components/operator

# Change this value to deploy custom overlays.
patches:
- patch: |-
    - op: replace
      path: /spec/path
      value: ./kubernetes/overlays/lre
  target:
    kind: Kustomization
    name: nativelink

# Modify this value to change the URL of the repository with deployment files.
#
# This is usually only necessary if you change deployment YAML files or
# NativeLink config files. If you only intend to change the Rust sources you can
# leave this as is and need to ensure that the Alerts below are patched to build
# your local sources.
- patch: |-
    - op: replace
      path: /spec/url
      value: https://github.com/aaronmondal/nativelink
    - op: replace
      path: /spec/ref/branch
      value: otel
  target:
    kind: GitRepository
    name: nativelink

# Setting the flake outputs to `./src_root#xxx` causes the Tekton pipelines to
# build nativelink from your local sources.
#
# During development, the following formats might be useful as well:
#
# `github:user/repo#outname` to build an image from an arbitrary flake output.
#
# `github:TraceMachina/nativelink?ref=pull/<PR_NUMBER>/head#<OUT>` to deploy a
# outputs from a Pull request.
- patch: |-
    - op: replace
      path: /spec/eventMetadata/flakeOutput
      value: ./src_root#image
  target:
    kind: Alert
    name: nativelink-image-alert
- patch: |-
    - op: replace
      path: /spec/eventMetadata/flakeOutput
      value: ./src_root#nativelink-worker-init
  target:
    kind: Alert
    name: nativelink-worker-init-alert
- patch: |-
    - op: replace
      path: /spec/eventMetadata/flakeOutput
      value: ./src_root#nativelink-worker-lre-cc
  target:
    kind: Alert
    name: nativelink-worker-alert