facebookincubator / dynolog

Dynolog is a telemetry daemon for performance monitoring and tracing. It exports metrics from different components in the system like the linux kernel, CPU, disks, Intel PT, GPUs etc. Dynolog also integrates with pytorch and can trigger traces for distributed training applications.
MIT License
260 stars 38 forks source link

[promethues] Add more metrics and smooth out prometheus implementation #181

Closed briancoutinho closed 11 months ago

briancoutinho commented 11 months ago

Details

  1. Add hostname and refactor everyone to common function.
  2. Make prometheus-cpp a git submodule and update build scripts
  3. Add more metrics, excluding per network and per GPU metrics.
  4. Bump version.

TestPlan

Ran on my laptop using a docker container.

How to Run

Please install Docker desktop.

  1. Download this docker file in a directory https://gist.github.com/briancoutinho/c5faaa60e49a5ad796b972e6b3ef175d
  2. Build using docker build . -t prometheus:v2
  3. Run docker container forwarding port and mounting dynolog open source repo docker run -p 9090:9090 -it -v ~/Work/dynolog_oss/dynolog:/workspace/dynolog prometheus:v2 /bin/bash
  4. Build dynolog ./scripts/build.sh

To get the logging setup add the following in prometheus.yml

  - job_name: "dynolog"
    static_configs:
      - targets: ["localhost:8080"]

Then run prometheus and dynolog.

./prometheus --config.file prometheus.yml &
cd -
./build/bin/dynolog -kernel_monitor_reporting_interval_s 10 -use_JSON -use_prometheus &

Open https://localhost:9090/

Screenshot 2023-10-16 at 5 52 05 PM Screenshot 2023-10-16 at 5 53 13 PM Screenshot 2023-10-16 at 5 53 25 PM Screenshot 2023-10-16 at 5 53 29 PM

facebook-github-bot commented 11 months ago

@briancoutinho has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot commented 11 months ago

This pull request was exported from Phabricator. Differential Revision: D50393435

facebook-github-bot commented 11 months ago

@briancoutinho merged this pull request in facebookincubator/dynolog@5a138ae36dbe7633ccebc1b82a2ddd0c3fd963db.