facebookincubator / dynolog

Dynolog is a telemetry daemon for performance monitoring and tracing. It exports metrics from different components in the system like the linux kernel, CPU, disks, Intel PT, GPUs etc. Dynolog also integrates with pytorch and can trigger traces for distributed training applications.
MIT License
188 stars 34 forks source link

Cannot capture CPU mertics inside docker container. #183

Closed stricklandye closed 7 months ago

stricklandye commented 8 months ago

Hi there. Why can't dynolog capture the CPU metrics and there is no log in /var/log/dynolog.log. Is there something wrong with the way I use it?

How to Reproduce

  1. Use the dynolog_hta.dockerfile build image
  2. Use following command to run a contanier:
    docker run -ti -v /usr/src:/usr/src:ro \
        -v /lib/modules/:/lib/modules:ro \
        -v /sys/kernel/debug/:/sys/kernel/debug:rw \
        --net=host --pid=host --privileged \
        dynolog:v0.1 \
        bash  
  3. Inside the container, start dynolog. But nothing about CPU metrics were recorded and the log file didn't exist either :(.
    /workspace# dynolog  --enable-perf-monitor
    I20231107 12:06:08.049125 968085 Main.cpp:151] Starting dynolog, version = 0.2.2, build git-hash = d70fccd
    I20231107 12:06:08.049226 968085 SimpleJsonServer.cpp:82] Listening to connections on port 1778
    I20231107 12:06:08.049235 968085 SimpleJsonServer.cpp:229] Launching RPC thread
    I20231107 12:06:08.049328 968088 SimpleJsonServer.cpp:207] Waiting for connection.
    I20231107 12:06:08.049425 968086 Main.cpp:82] Running kernel monitor loop : interval = 60 s.
    I20231107 12:06:08.061357 968087 Main.cpp:106] Running perf monitor loop : interval = 60 s.
    /workspace# ls /var/log
    alternatives.log  apt  bootstrap.log  btmp  dpkg.log  faillog  lastlog  wtmp

    By the way, I have also tried to run dynolog in host but dynolog is not compatible with the openssl that already installed(openssl 3.0.2). It seems that dynolog requires openssl 1.1.1, however it will no longer be maintained soon (according to openssl doc). So I think it is better to use the newer openSSL.

briancoutinho commented 7 months ago

Hi @stricklandye for the docker/HTA demo ah dynolog is not configured to emit the metrics. Try adding these flags and you can also configure the logging/measurement interval

dynolog  --enable-perf-monitor --use_JSON   --kernel_monitor_reporting_interval_s 10 --perf_monitor_reporting_interval_s 10

Let us know if this works. Flag reference = https://github.com/facebookincubator/dynolog/blob/main/dynolog/src/Main.cpp#L44

PS: There is an early stage support for Prometheus too, see the test plan in the PR below for instructions, you may need to merge the docker file commands from this test plan with yours above. https://github.com/facebookincubator/dynolog/pull/181

PS: open ssl is probably getting used by a dependency we use ~cpr, will get back on it.

stricklandye commented 7 months ago

Sorry for not reply in time, using --use_JSON works well. I will close this issue.

stricklandye commented 7 months ago

@briancoutinho Hi, I have also some questions:

  1. the doc says the KINETO_USE_DAEMON =1 and dynolog --enable_ipc_monitor are required to trace GPU. What is the right way to trace GPU inside a Kubernetes Cluster? If running dynolog as daemon service in host, it seems cannot trace AI program runs inside a container and It's not feasible to bundle dynolog with every containter images either.
  2. Is the performance overhead of dynolog GPU trace high?
briancoutinho commented 7 months ago

Hi,

  1. Yes that’s a great question. The PyTorch program and dynolog communicate using Linux named sockets (you can read about the design in ipcfabric/ directory in dynolog).

We haven’t tried this on docker containers actually. It does work if dynolog runs as root and we use containers based on Linux cgroups. Inside Meta we have a docker like thing (twine) Maybe docker uses cgroups too but needs some special permission setting. How about filing an issue for “support dynolog tracing on docker” and will do some research on it.

  1. The overhead is around 3-4% afaik. By the way the tracing happens inside the application using PyTorch/kineto library, so dynolog is just passing the trace start message.

Actually am out on vacation for few weeks, someone from my team/Meta could help out:)

Best, Brian

On Tue, Nov 21, 2023 at 1:26 AM strickland @.***> wrote:

@briancoutinho https://github.com/briancoutinho Hi, I have also several questions:

  1. the doc says the KINETO_USE_DAEMON =1 and dynolog --enable_ipc_monitor is required to trace GPU. What is the right way to trace GPU inside a Kubernetes Cluster? If running dynolog as daemon service, it seems cannot trace AI program runs inside a container and It's not feasible to bundle dynolog with every containter images.
  2. Is the performance overhead of dynolog GPU trace high?

— Reply to this email directly, view it on GitHub https://github.com/facebookincubator/dynolog/issues/183#issuecomment-1818971510, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABUZ7ZGWUBEUSITWNTUVJXDYFND7VAVCNFSM6AAAAAA7BDA6JCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJYHE3TCNJRGA . You are receiving this because you were mentioned.Message ID: @.***>

stricklandye commented 7 months ago

I see. Have a great time :D