Closed dushyantbehl closed 7 months ago
Hi @alberttorosyan Thanks for marking this as a bug. Could I be of any help here to fix things or dig deeper. Please let me know, I'll be happy to see if I can help.
@dushyantbehl, here's the code snippet which extracts the GPU information before passing it to Aim tracking methods:
gpu_info = dict()
handle = nvml.nvmlDeviceGetHandleByIndex(i)
try:
util = nvml.nvmlDeviceGetUtilizationRates(handle)
# GPU utilization percent
gpu_info["gpu"] = round10e5(util.gpu)
except nvml.NVMLError_NotSupported:
pass
try:
# Get device memory
memory = nvml.nvmlDeviceGetMemoryInfo(handle)
# Device memory usage
# 'memory_used': round10e5(memory.used / 1024 / 1024),
gpu_info["gpu_memory_percent"] = round10e5(memory.used * 100 / memory.total)
except nvml.NVMLError_NotSupported:
pass
try:
# Get device temperature
nvml_tmp = nvml.NVML_TEMPERATURE_GPU
temp = nvml.nvmlDeviceGetTemperature(handle, nvml_tmp)
# Device temperature
gpu_info["gpu_temp"] = round10e5(temp)
except nvml.NVMLError_NotSupported:
pass
try:
# Compute power usage in watts and percent
power_watts = nvml.nvmlDeviceGetPowerUsage(handle) / 1000
power_cap = nvml.nvmlDeviceGetEnforcedPowerLimit(handle)
power_cap_watts = power_cap / 1000
power_watts / power_cap_watts * 100
# Power usage in watts and percent
gpu_info["gpu_power_watts"]: round10e5(power_watts)
# gpu_info["power_percent"] = round10e5(power_usage)
except nvml.NVMLError_NotSupported:
pass
Each call to nmvl
API is wrapped with try/except block. If you see the power consumption, temperature, etc. that means that the specific call has failed, due to device support.
@alberttorosyan It was not a device support problem since directly using nmvl APIs worked. After some debugging, I found the cause. Have opened a PR here: https://github.com/aimhubio/aim/pull/3044
Fix merged here - https://github.com/aimhubio/aim/pull/3044
❓Question
I have been using aimstack version
3.17.5
and unable to see any GPU memory consumption when doing aimstack runs.The dashboard shows GPU %, GPU temprature but not the memory used. Is there a way to track what is going on?
I am happy to share any information about the environment you may have. Thanks in advance.