Syllo / nvtop

GPU & Accelerator process monitoring for AMD, Apple, Huawei, Intel, NVIDIA and Qualcomm
Other
8.1k stars 293 forks source link

AMDGPU support #135

Closed zhuyifei1999 closed 2 years ago

zhuyifei1999 commented 2 years ago

A picture is worth a thousand words: Screenshot_2022-03-21_10-53-19 c

This PR isn't fully ready for merge yet. I think I'd like some feedback and do some polishing first (hence the RFC). My AMDGPU is also an iGPU / APU which limits how much I can test myself.

Most the access is using libdrm and kernel APIs. <drm/amdgpu_drm.h> is a kernel UAPI header so I directly included it. For libdrm I'm using dlopen and re-declaring the constants to avoid a compile-time dependency, just like what the original code does with NVML.

I also did some code refactoring to make my life easier, hope that is okay. I really cannot deal with typedef structs

CC #106, i915 kernel changes doesn't seem to have merged into mainline yet.

Syllo commented 2 years ago

Amazing! I will try to take a look at it this weekend.

Syllo commented 2 years ago

I successfully tested on a system I have access to with an AMD Radeon RX 6800 XT. I also fixed some issues I encountered while browsing the code.

For the fan speed, I saw that you added a comment. I did not find a way to get it through AMDGPU. lm_sensors seem to get the correct info though, so I will investigate where to retrieve it.

Edit: Never mind I found that for the fan https://dri.freedesktop.org/docs/drm/gpu/amdgpu.html

zhuyifei1999 commented 2 years ago

Awesome!

Yeah I didn't bother with checking for close(-1) since it's practically a no-op, similar to free(NULL); though unlike free it has syscall overhead.

The reason I didn't do fan speed was that it seemed like it was slightly convoluted to retrieve. Rather than being a libdrm API call, I have to go through sysfs; doable but code can get messy and I can't test it.

Syllo commented 2 years ago

Other than that, does the patch work on your APU? I am wondering, because in that case the fan is shared with the CPU and might not be registered with HWMON as such.

zhuyifei1999 commented 2 years ago

The APU does not expose fan information on the GPU's hwmon, AFAICT.

zhuyifei1999@zhuyifei1999-ThinkPad-P14s-Gen-2a /sys/bus/pci/devices/0000:08:00.0 $ lspci | grep VGA
08:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Cezanne (rev d1)
zhuyifei1999@zhuyifei1999-ThinkPad-P14s-Gen-2a /sys/bus/pci/devices/0000:08:00.0 $ ls
ari_enabled                devspec           hwmon          max_link_speed           msi_irqs                           pp_dpm_sclk            rescan         smartshift_apu_power
backlight                  dma_mask_bits     i2c-0          max_link_width           numa_node                          pp_dpm_socclk          reset          smartshift_dgpu_power
boot_vga                   driver            i2c-1          mem_info_gtt_total       pcie_replay_count                  pp_force_state         reset_method   subsystem
broken_parity_status       driver_override   i2c-12         mem_info_gtt_used        power                              pp_mclk_od             resource       subsystem_device
class                      drm               i2c-2          mem_info_preempt_used    power_dpm_force_performance_level  pp_num_states          resource0      subsystem_vendor
config                     enable            i2c-3          mem_info_vis_vram_total  power_dpm_state                    pp_od_clk_voltage      resource0_wc   thermal_throttling_logging
consistent_dma_mask_bits   firmware_node     iommu          mem_info_vis_vram_used   power_state                        pp_power_profile_mode  resource2      uevent
consumer:pci:0000:08:00.1  fw_version        iommu_group    mem_info_vram_total      pp_cur_state                       pp_sclk_od             resource2_wc   vbios_version
current_link_speed         gpu_busy_percent  irq            mem_info_vram_used       pp_dpm_dcefclk                     pp_table               resource4      vendor
current_link_width         gpu_metrics       link           mem_info_vram_vendor     pp_dpm_fclk                        product_name           resource5
d3cold_allowed             graphics          local_cpulist  modalias                 pp_dpm_mclk                        product_number         revision
device                     hdcp_srm          local_cpus     msi_bus                  pp_dpm_pcie                        remove                 serial_number
zhuyifei1999@zhuyifei1999-ThinkPad-P14s-Gen-2a /sys/bus/pci/devices/0000:08:00.0 $ ls hwmon/
hwmon1
zhuyifei1999@zhuyifei1999-ThinkPad-P14s-Gen-2a /sys/bus/pci/devices/0000:08:00.0 $ ls hwmon/hwmon1/
device  freq1_input  freq1_label  in0_input  in0_label  in1_input  in1_label  name  power  power1_average  power1_label  subsystem  temp1_input  temp1_label  uevent
Syllo commented 2 years ago

I think that we are almost good. I will update the README and check for the amdgpu_drm.h to conditionally compile the support for AMDGPU.

zhuyifei1999 commented 2 years ago

Nice!

I'm including amdgpu_drm.h directly because it's a kernel UAPI header, and I think any sane compiler toolchain on Linux would include the kernel UAPI headers. Though, up to you, could copy the declarations too since kernel-userspace ABI has guaranteed stability.

Syllo commented 2 years ago

The problem with NVIDIA was that the nvml header was not packaged on all the distributions, and downloading a recent header could break the build. Also I was not certain I could redistribute a version of the nvidia header. That is why I included the definitions directly.

Things are different with libdrm. The header are consistently available on any distribution and updated with the kernel. Much cleaner than NVIDIA.

Hence, relying on libdrm for AMDGPU support is fine in my opinion!

If you are fine with the latest changes I will merge it into master.

zhuyifei1999 commented 2 years ago

The problem with NVIDIA was that the nvml header was not packaged on all the distributions, and downloading a recent header could break the build. Also I was not certain I could redistribute a version of the nvidia header. That is why I included the definitions directly.

I see.

Neat Videocard TOP

Nice naming lol

If you are fine with the latest changes I will merge it into master.

No objections. Go ahead. Thanks!