facebookincubator / dynolog

Dynolog is a telemetry daemon for performance monitoring and tracing. It exports metrics from different components in the system like the linux kernel, CPU, disks, Intel PT, GPUs etc. Dynolog also integrates with pytorch and can trigger traces for distributed training applications.
MIT License
227 stars 34 forks source link

run dynolog Segmentation fault #168

Open zhuzhenxxx opened 12 months ago

zhuzhenxxx commented 12 months ago

The host machine is centos, and the container built with dynolog's dockerfile executes ./build/dynolog/src/dynolog -enable_gpu_monitor -use_JSON and a segment error occurs. dcgm fails to start using systemctl. After installation, the command line manually executes /usr/bin/nv-hostengine -n --service-account nvidia-dcgm to provide services.

root@j66f07370 dynolog]# Started host engine version 3.1.8 using port number: 5555 /usr/bin/nv-hostengine -n ./build/dynolog/src/dynolog -enable_gpu_monitor -use_JSON I20230726 09:32:50.613127 4285 Main.cpp:163] Starting dynolog, version = 0.3.0, build git-hash = bb3c3a0 I20230726 09:32:50.613193 4285 DcgmGroupInfo.cpp:125] Creating DCGM instance with fields: 100 155 204 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 I20230726 09:32:50.613641 4285 DcgmApiStub.cpp:144] Parse "libdcgm.so.3", dcgm version = 3 I20230726 09:32:50.613673 4285 DcgmApiStub.cpp:175] Loaded dcgm dynamic library I20230726 09:32:50.634850 4285 DcgmGroupInfo.cpp:172] Added group id 2 I20230726 09:32:50.634871 4285 DcgmGroupInfo.cpp:182] Found 2 supported devices, with id: I20230726 09:32:50.634882 4285 DcgmGroupInfo.cpp:187] Successfully add device: 0 I20230726 09:32:50.634891 4285 DcgmGroupInfo.cpp:187] Successfully add device: 1 I20230726 09:32:50.634907 4285 DcgmGroupInfo.cpp:218] Added field group 4 to group 2 I20230726 09:32:50.634915 4285 DcgmGroupInfo.cpp:228] Watching DCGM fields at interval (ms) = 10000 E20230726 09:32:50.715700 4285 DcgmGroupInfo.cpp:239] Failed dcgmWatchFields() return: -33 with group 2, field group 4 I20230726 09:32:50.715747 4285 DcgmGroupInfo.cpp:414] Unwatched profiling fields for group id 2 E20230726 09:32:50.715778 4285 DcgmGroupInfo.cpp:420] Failed dcgmUnwatchFields() for field group 4, return: -33 I20230726 09:32:50.715791 4285 DcgmGroupInfo.cpp:431] Destroyed field group 4 I20230726 09:32:50.715837 4285 DcgmGroupInfo.cpp:439] Destroyed group 2 I20230726 09:32:51.638577 4285 DcgmGroupInfo.cpp:445] Stopped embedded mode I20230726 09:32:51.638674 4285 DcgmGroupInfo.cpp:451] Shutdown DCGM I20230726 09:32:51.638762 4291 Main.cpp:143] Running DCGM loop : interval = 10 s. I20230726 09:32:51.638803 4285 SimpleJsonServer.cpp:82] Listening to connections on port 1778 I20230726 09:32:51.638808 4291 Main.cpp:145] DCGM fields: 100,155,204,1001,1002,1003,1004,1005,1006,1007,1008,1009,1010,1011,1012 I20230726 09:32:51.638819 4285 SimpleJsonServer.cpp:229] Launching RPC thread

zhuzhenxxx commented 12 months ago

cpu profiling was successful

briancoutinho commented 12 months ago

@zhuzhenxxx I looked at the error you were seeing -33 you can find it in dcgm_structs.h

DCGM_ST_MODULE_NOT_LOADED = -33, //!< This request is serviced by a module of DCGM that is not currently loaded

This means that some feature has not been loaded or certain field groups are not supported, this might be due to container environment.

How about this, you can just add one field "--dcgm_fields 100" that is SM Clock to the dynolog command line https://github.com/NVIDIA/DCGM/blob/master/dcgmlib/dcgm_fields.h#L435 The command line flags are documented here -https://github.com/facebookincubator/dynolog#gpu-monitoring

zhuzhenxxx commented 11 months ago

Thank you very much, I tried the method you suggested, and after using --dcgm_fields 100, there is no segfault, but there is a new problem, I receive a return value of -6 from dcgm, and he tells me Feature not supported. What does it mean

stricklandye commented 8 months ago

@zhuzhenxxx I looked at the error you were seeing -33 you can find it in dcgm_structs.h

DCGM_ST_MODULE_NOT_LOADED = -33, //!< This request is serviced by a module of DCGM that is not currently loaded

This means that some feature has not been loaded or certain field groups are not supported, this might be due to container environment.

How about this, you can just add one field "--dcgm_fields 100" that is SM Clock to the dynolog command line https://github.com/NVIDIA/DCGM/blob/master/dcgmlib/dcgm_fields.h#L435 The command line flags are documented here -https://github.com/facebookincubator/dynolog#gpu-monitoring

I have also got -6 :(.

stricklandye commented 8 months ago

After searching answer in source code. In header file dynolog/src/gpumon/dcgm_structs.h, the -6 indicates that some features are not available but I can run dcgm-exporter directly. I don't know why.