Syllo / nvtop

GPU & Accelerator process monitoring for AMD, Apple, Huawei, Intel, NVIDIA and Qualcomm
Other
8.19k stars 292 forks source link

MSM and Adreno support #204

Closed Sonicadvance1 closed 1 year ago

Sonicadvance1 commented 1 year ago

An image to show it working. Image_2023-04-09_20-37-40

Mostly based around the amdgpu implementation. Which uses a combination of the drm and fdinfo interfaces for querying information.

Some things to note.

robclark commented 1 year ago

fwiw, there is a msm_gpu_freq_change tracepoint, which might be a reasonable way to monitor gpu freq. It works in the same way as the i915 tracepoint (not sure if nvtop uses that?)..

As far as temp, I think there are usually one or more tsens's linked to gpu.. and I think they should show up in sysfs. Not sure what the portable way to fish that out would be.. maybe find the cooling-device associated with the gpu and then see what tsens's it uses?

Sonicadvance1 commented 1 year ago

fwiw, there is a msm_gpu_freq_change tracepoint, which might be a reasonable way to monitor gpu freq. It works in the same way as the i915 tracepoint (not sure if nvtop uses that?)..

As far as temp, I think there are usually one or more tsens's linked to gpu.. and I think they should show up in sysfs. Not sure what the portable way to fish that out would be.. maybe find the cooling-device associated with the gpu and then see what tsens's it uses?

If there was some portable way to query this but most cases I'm hardcoding the /sys/devices/platform/soc@0/3d00000.gpu/ path in things I'm doing locally.

For "current" frequency through fdinfo it might be reasonable to average all the drm-maxfreq-gpu frequencies? Not sure if actually better. Nevermind drm-maxfreq-gpu won't work since that is just the max frequency of the GPU?

Sonicadvance1 commented 1 year ago
diff --git a/src/extract_gpuinfo_msm.c b/src/extract_gpuinfo_msm.c
index 244240e..36c0fb3 100644
--- a/src/extract_gpuinfo_msm.c
+++ b/src/extract_gpuinfo_msm.c
@@ -69,6 +69,8 @@ struct gpu_info_msm {
   struct gpu_info base;
   int fd;

+  int freq_file; // For the current frequency tracking.
+
   struct msm_process_info_cache *last_update_process_cache, *current_update_process_cache; // Cached processes info
 };

@@ -206,6 +208,10 @@ void gpuinfo_msm_shutdown(void) {
   for (unsigned i = 0; i < msm_gpu_count; ++i) {
     struct gpu_info_msm *current = &gpu_infos[i];
     _drmFreeVersion(current->drmVersion);
+
+    if (current->freq_file != -1) {
+      close(current->freq_file);
+    }
   }

   free(gpu_infos);
@@ -421,6 +427,9 @@ static bool gpuinfo_msm_get_device_handles(struct list_head *devices, unsigned *
     gpu_infos[msm_gpu_count].fd = fd;
     gpu_infos[msm_gpu_count].base.vendor = &gpu_vendor_msm;

+    // This path is currently hardcoded for my device.
+    gpu_infos[msm_gpu_count].freq_file = open("/sys/devices/platform/soc@0/3d00000.gpu/devfreq/3d00000.gpu/cur_freq", O_RDONLY);
+
     list_add_tail(&gpu_infos[msm_gpu_count].base.list, devices);
     // Register a fdinfo callback for this GPU
     processinfo_register_fdinfo_callback(parse_drm_fdinfo_msm, &gpu_infos[msm_gpu_count].base);
@@ -433,6 +442,24 @@ static bool gpuinfo_msm_get_device_handles(struct list_head *devices, unsigned *
   return true;
 }

+static int read_pattern_from_fd(int fd, const char *format, ...) {
+  if (fd == -1) {
+    return 0;
+  }
+
+  va_list args;
+  va_start(args, format);
+  char Tmp[16];
+  int Read = pread(fd, Tmp, sizeof(Tmp), 0);
+  if (Read == -1) {
+    return 0;
+  }
+  Tmp[Read] = '\0';
+  int matches = vsscanf(Tmp, format, args);
+  va_end(args);
+  return matches;
+}
+
 static int gpuinfo_msm_query_param(int gpu, uint32_t param, uint64_t *value) {
   struct drm_msm_param req = {
     .pipe = MSM_PIPE_3D0, // Only the 3D pipe.
@@ -479,10 +506,13 @@ void gpuinfo_msm_refresh_dynamic_info(struct gpu_info *_gpu_info) {
   dynamic_info->encode_decode_shared = true;

   // GPU clock
+  uint64_t current_gpu_freq;
+  if (read_pattern_from_fd(gpu_info->freq_file, "%lu", &current_gpu_freq) == 1) {
+    SET_GPUINFO_DYNAMIC(dynamic_info, gpu_clock_speed, current_gpu_freq / 1000000);
+  }
+
   uint64_t val;
   if (gpuinfo_msm_query_param(gpu_info->fd, MSM_PARAM_MAX_FREQ, &val) == 0) {
-    // TODO: No way to query current clock speed.
-    SET_GPUINFO_DYNAMIC(dynamic_info, gpu_clock_speed, val / 1000000);
     SET_GPUINFO_DYNAMIC(dynamic_info, gpu_clock_speed_max, val / 1000000);
   }

If there was some way to correlate to the hardcoded path correctly then this is likely precise enough for this application.

robclark commented 1 year ago

fwiw, there is a msm_gpu_freq_change tracepoint, which might be a reasonable way to monitor gpu freq. It works in the same way as the i915 tracepoint (not sure if nvtop uses that?).. As far as temp, I think there are usually one or more tsens's linked to gpu.. and I think they should show up in sysfs. Not sure what the portable way to fish that out would be.. maybe find the cooling-device associated with the gpu and then see what tsens's it uses?

If there was some portable way to query this but most cases I'm hardcoding the /sys/devices/platform/soc@0/3d00000.gpu/ path in things I'm doing locally.

~For "current" frequency through fdinfo it might be reasonable to average all the drm-maxfreq-gpu frequencies? Not sure if actually better.~ Nevermind drm-maxfreq-gpu won't work since that is just the max frequency of the GPU?

yeah, just the max-freq

But it's possible that we could add something in fdinfo..

(Longer term, ideally we could get everything from fdinfo so that tools like nvtop don't need much driver-specific support)

robclark commented 1 year ago

If there was some way to correlate to the hardcoded path correctly then this is likely precise enough for this application.

there is /proc/device-tree.. but I guess if we can get more from fdinfo or some other standardized way then that might be better for making things work the same way across drivers..

Syllo commented 1 year ago

Hello, Thanks for this patch, I'll take a look at it shortly.

I generally am not against using a dependency if it simplifies the code and is pretty much guaranteed to be present on the different distributions. Even more so if it can be used by multiple drivers. Although we'd need some wrapper around the lib since nvtop relies on the dynamic loader to support multiple GPUs without requiring all the GPU libs to be present on all the systems.

@robclark you mentioned trace-points, but if I remember correctly they usually require some additional permission from user space and I want to avoid depending on anything that requires elevated privileges.

robclark commented 1 year ago

I generally am not against using a dependency if it simplifies the code and is pretty much guaranteed to be present on the different distributions. Even more so if it can be used by multiple drivers. Although we'd need some wrapper around the lib since nvtop relies on the dynamic loader to support multiple GPUs without requiring all the GPU libs to be present on all the systems.

The dependency would be libGL, if you wanted to use GL_RENDERER to get the GPU name in a generic way. (Maybe you'd also need gbm to ensure you are getting the correct GPU in the case of multiple GPUs?)

@robclark you mentioned trace-points, but if I remember correctly they usually require some additional permission from user space and I want to avoid depending on anything that requires elevated privileges.

Yeah, I think trace-points would need root.. although fdinfo also needs permissions to access information about other user's processes.

We could probably hold off on freq/temp for now.. there is currently some discussion on dri-devel about how we could expose some of this in a more portable way. (Fwiw, the fdinfo stuff for utilization and memory usage should also work on i915/amdgpu.. my hope is that we can get to the point where tools like nvtop don't need vendor specific stuff, at least for the upstream drm drivers.)

Syllo commented 1 year ago

@robclark thanks for the precision. AMDGPU and Intel both have the fdinfo and hwmon support. AMD exposes temp, fan and power device info through hwmon. Intel only provides power for their dedicated graphics cards.

robclark commented 1 year ago

Sounds good. I guess that there is little chance for the name to be changed?

There is still some debate, so I'd hold off merging at least the GPU memory info until that is resolved.

robclark commented 1 year ago

@robclark thanks for the precision. AMDGPU and Intel both have the fdinfo and hwmon support. AMD exposes temp, fan and power device info through hwmon. Intel only provides power for their dedicated graphics cards.

There are also hwmon's that expose temp for adreno.. the open question is how for userspace to find them in a portable way, since they aren't directly linked (currently) to the drm device. I guess we need to invent some way to link them so that userspace can figure out which hwmon's look at.. there are a lot, associated with various different hw blocks on the SoC.