aimhubio / aim

Aim 💫 — An easy-to-use & supercharged open-source experiment tracker.
https://aimstack.io
Apache License 2.0
4.94k stars 299 forks source link

Not able to see GPU memory consumption as part of system metrics in aim stack. #3020

Closed dushyantbehl closed 7 months ago

dushyantbehl commented 9 months ago

❓Question

I have been using aimstack version 3.17.5 and unable to see any GPU memory consumption when doing aimstack runs.

The dashboard shows GPU %, GPU temprature but not the memory used. Is there a way to track what is going on?

I am happy to share any information about the environment you may have. Thanks in advance.

dushyantbehl commented 9 months ago

Hi @alberttorosyan Thanks for marking this as a bug. Could I be of any help here to fix things or dig deeper. Please let me know, I'll be happy to see if I can help.

alberttorosyan commented 9 months ago

@dushyantbehl, here's the code snippet which extracts the GPU information before passing it to Aim tracking methods:

                gpu_info = dict()
                handle = nvml.nvmlDeviceGetHandleByIndex(i)
                try:
                    util = nvml.nvmlDeviceGetUtilizationRates(handle)
                    # GPU utilization percent
                    gpu_info["gpu"] = round10e5(util.gpu)
                except nvml.NVMLError_NotSupported:
                    pass
                try:
                    # Get device memory
                    memory = nvml.nvmlDeviceGetMemoryInfo(handle)
                    # Device memory usage
                    # 'memory_used': round10e5(memory.used / 1024 / 1024),
                    gpu_info["gpu_memory_percent"] = round10e5(memory.used * 100 / memory.total)
                except nvml.NVMLError_NotSupported:
                    pass
                try:
                    # Get device temperature
                    nvml_tmp = nvml.NVML_TEMPERATURE_GPU
                    temp = nvml.nvmlDeviceGetTemperature(handle, nvml_tmp)
                    # Device temperature
                    gpu_info["gpu_temp"] = round10e5(temp)
                except nvml.NVMLError_NotSupported:
                    pass
                try:
                    # Compute power usage in watts and percent
                    power_watts = nvml.nvmlDeviceGetPowerUsage(handle) / 1000
                    power_cap = nvml.nvmlDeviceGetEnforcedPowerLimit(handle)
                    power_cap_watts = power_cap / 1000
                    power_watts / power_cap_watts * 100
                    # Power usage in watts and percent
                    gpu_info["gpu_power_watts"]: round10e5(power_watts)
                    # gpu_info["power_percent"] = round10e5(power_usage)
                except nvml.NVMLError_NotSupported:
                    pass

Each call to nmvl API is wrapped with try/except block. If you see the power consumption, temperature, etc. that means that the specific call has failed, due to device support.

ChanderG commented 8 months ago

@alberttorosyan It was not a device support problem since directly using nmvl APIs worked. After some debugging, I found the cause. Have opened a PR here: https://github.com/aimhubio/aim/pull/3044

dushyantbehl commented 7 months ago

Fix merged here - https://github.com/aimhubio/aim/pull/3044