Open aaronmondal opened 2 weeks ago
cc @allada @SchahinRohani You might want to play around with this while it's in preview.
@allada One thing that we'll need to figure out is where to put the OtlpServer
in the nativelink-config. At the moment it requires a dummy listener which might not play well with workers.
@SchahinRohani You might want to look into OTLP, Kafka topics and NATS Jetstream. This initial implementation doesn't add structure, but at least it provides a central point to aggregate the logs, traces and metrics of nativelink deployments.
I'll polish this a bit, but for now we have the following (assuming the pod for the nativelink-cas is e.g. kubectl port-forward nativelink-cas-ff6544bb8-v4w86)
# Telemetry (CPU usage etc)
kubectl port-forward nativelink-cas-ff6544bb8-v4w86 8888
curl localhost:8888/metrics
# The info that was previously in the experimental_prometheus endpoint
kubectl port-forward nativelink-cas-ff6544bb8-v4w86 8888
curl localhost:8889/metrics
# The aggregated stream across cas, scheduler and worker deployments
kubectl exec -n nats-system nats-box-5d4d987f5b-thbfs -- nats stream info TELEMETRY
(also I still have a small bug in the kustomization. You'll need to apply it twice. The one I use is this variant of the deploy/dev
operator:
native up
# Then modify kubernetes/deploy/dev to this and run `kubectl apply -k deploy/dev`:
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
components:
- ../../kubernetes/components/operator
# Change this value to deploy custom overlays.
patches:
- patch: |-
- op: replace
path: /spec/path
value: ./kubernetes/overlays/lre
target:
kind: Kustomization
name: nativelink
# Modify this value to change the URL of the repository with deployment files.
#
# This is usually only necessary if you change deployment YAML files or
# NativeLink config files. If you only intend to change the Rust sources you can
# leave this as is and need to ensure that the Alerts below are patched to build
# your local sources.
- patch: |-
- op: replace
path: /spec/url
value: https://github.com/aaronmondal/nativelink
- op: replace
path: /spec/ref/branch
value: otel
target:
kind: GitRepository
name: nativelink
# Setting the flake outputs to `./src_root#xxx` causes the Tekton pipelines to
# build nativelink from your local sources.
#
# During development, the following formats might be useful as well:
#
# `github:user/repo#outname` to build an image from an arbitrary flake output.
#
# `github:TraceMachina/nativelink?ref=pull/<PR_NUMBER>/head#<OUT>` to deploy a
# outputs from a Pull request.
- patch: |-
- op: replace
path: /spec/eventMetadata/flakeOutput
value: ./src_root#image
target:
kind: Alert
name: nativelink-image-alert
- patch: |-
- op: replace
path: /spec/eventMetadata/flakeOutput
value: ./src_root#nativelink-worker-init
target:
kind: Alert
name: nativelink-worker-init-alert
- patch: |-
- op: replace
path: /spec/eventMetadata/flakeOutput
value: ./src_root#nativelink-worker-lre-cc
target:
kind: Alert
name: nativelink-worker-alert
This commit adds the OTLP exports to Nativelink and extends the
nativelink
deployments in the operator with OpenTelemetryCollector sidecars. The exposed traces, metrics and logs are published through Kafka to NATS Jetstream.This change is