Ricks-Lab / gpu-utils

A set of utilities for monitoring and customizing GPU performance
GNU General Public License v3.0
136 stars 23 forks source link

Gpu Memory utilization #73

Closed tlgalenson closed 4 years ago

tlgalenson commented 4 years ago

Hi, I am running Einstein@Home on a Radeon VII under Ubuntu 18.4.x I am running into an issue that looks like I am running out of memory on the gpu. If I exceed say 5 Gravity Wave gpu tasks, the gpu tasks stall.

Specifically the "Memory Load %" goes to 0 while the gpu load % goes to 100%. And the tasks appear to stop calculating and just increase wall clock time.

It would be very helpful if I could determine how much memory in the gpu is being used vs. available. Since I am also running two R5700's on the Gamma Ray Pulsar#1 search (different system) it would be helpful to see if I could boost from 3 gpu tasks to 4 without hitting the memory limit.

If I am missing something that is already present. I apologize and request instruction(s). Tom M

Ricks-Lab commented 4 years ago

There are additional memory sensors available for Radeon VII, but the amdgpu utilities currently only read/report the memory loading. You can exam the other memory parameters and determine if any may provide insight into your issue. First, use amdgpu-ls to get the card path of the GPU in question. If you examine the contents of the directory you will see several memory related driver files. You can cat each one to examine their contents. Let me know if you find that any of the additional memory information is useful and perhaps I can add visibility to in in a future release.

csecht commented 4 years ago

Tom raises a good point. The RX 5600 XT also has these sensors, and from looking at its files in the device directory, I can find explanation for some task performance issues I've seen for E@H gravitational wave crunching. For my two cents, I'd think it would be handy to have amdgpu-monitor list, right below Mem Load %, an entry for Mem Use %, which would be mem_info_gtt_used divided by mem_info_gtt_total.
It also would be handy if amdgpu-ls reported, in the last section, following Current Memory Loading:

   Current Memory Used (GB): <mem_info_gtt_used>
      Total Memory (GB): <mem_info_gtt_total>
   Current Memory VRAM Used (GB): <mem_info_vram_used>
      Total Memory VRAM (GB): < mem_info_vram_total>

Current memory values vary depending on the number of concurrent tasks, the data in the current task, and the app being used by the boinc-client. There has been a bit of discussion on the E@H forums about GPU memory and VRAM use for certain tasks. I don't see the need to plot any of the memory values because they don't seem to be that dynamic.

Ricks-Lab commented 4 years ago

I started working on this. Only visible in amdgpu-ls and formatting improvements needed. Available in the latest on master.

csecht commented 4 years ago

That's a nice feature addition for amdgpu-ls.

Ricks-Lab commented 4 years ago

I am now including usage and improved formatting in amdgpu-ls. Will work on monitor next.

Ricks-Lab commented 4 years ago

Latest on master includes memory in amdgpu-monitor

Ricks-Lab commented 4 years ago

One concern is that gtt memory appears to be system memory on my Radeon VII system, but is the same as GPU VRAM for Fiji and Vega64 systems. Maybe there is a better description of GTT memory.

tlgalenson commented 4 years ago

Thank you Rick. I will download/update my utilities tomorrow and see if that helps.

I fired up the Radeon VII I have under Windows where it "cheerfully" ran 7 Gamma Ray (E@H) tasks without running out of gpu memory. But the memory controller couldn't quite manage 8. Under Linux the same tasks seem to be limited to 4-5 on a Radeon VII.

I hope all this will help me figure out how far I can push these gpus. I probably will end up only with the R5700's rather than keeping the Radeon VII. That will leave me with 6-9 RX 5XX gpus to stare at on one system.

Tom M

csecht commented 4 years ago

Downloaded the latest Master. From my RX 5600 XT, amdgpu-ls lists Total GTT Memory (GB): 5.984, but the content of mem_info_gtt_total is 6425673728 (6.425 GB). Similarly, for my RX 570, total GTT is listed as 4.000 GB, but mem_info_gtt_total has 4294967296 (4.294 GB). Why the difference?

The RX5600xt and RX 570 are the same as your Fiji and Vega64 cards; total VRAM is the same as total GTT.

When running the RX5600xt with E@H grav. wave tasks, my VRAM Usage is ~83%, depending on the tasks being run, but GTT Usage is 0.763%, which I don't understand. Seems like a mighty low GPU memory usage. On the RX570s running E@H pulsar tasks, VRAM Usage is ~35%, and GTT usage is ~3%. Again, unexpectedly low.

Ricks-Lab commented 4 years ago

Downloaded the latest Master. From my RX 5600 XT, amdgpu-ls lists Total GTT Memory (GB): 5.984, but the content of mem_info_gtt_total is 6425673728 (6.425 GB). Similarly, for my RX 570, total GTT is listed as 4.000 GB, but mem_info_gtt_total has 4294967296 (4.294 GB). Why the difference?

I’m using 1024 instead of 1000 for conversions. It seems to match advertised capacity better. I researched and found there was a standard published in 1998 that 1000 should be used for KB and 1024 for KiB, but it is not widely followed.