Ricks-Lab / gpu-utils

A set of utilities for monitoring and customizing GPU performance
GNU General Public License v3.0
136 stars 23 forks source link

v3.0 documentation #59

Closed csecht closed 4 years ago

csecht commented 4 years ago

Here are some ideas for edits to the User Guide. Let me know what you think and I can include them for a pull request. In "Getting Started" section, it says: After saving, update grub:

sudo update-grub

and then reboot.

But, after updating the ppfeaturemask code in grub, I didn't have to reboot for new PAC features (e.g. overclocking) to work. However, amdgpu-ls still lists the feature mask as what is was before the grub update. Is the featuremask code reported by -ls read from the last boot record and not the current grub file? Is a reboot only necessary following update-grub for the initial loading of amdgpu.ppfeaturemask? This is more for my clarification on how grub works than any edits to the text.

In the "Using amdgpu-ls" section, I see in the ppm graphic that the timings table were removed. If user wonder what's going on when the table of timing values is reported in their terminal, however, it may be helpful to add an explanation, unless you just want to keep less clutter in the User Guide. From the ROCm-smi page, https://github.com/RadeonOpenCompute/ROC-smi/tree/roc-2.7.0. , the column headers for the ppm timings table could be included along with brief definitions, like this:

Card Number: 1
   Card Model: Radeon RX 570
   Card: /sys/class/drm/card1/device
   Power Performance Mode: manual
                    SCLK_UP_HYST  SCLK_DOWN_HYST  SCLK_ACTIVE_LEVEL  MCLK_UP_HYST  MCLK_DOWN_HYST  MCLK_ACTIVE_LEVEL
 0:   BOOTUP_DEFAULT          -             -             -             -             -             -
 1:   3D_FULL_SCREEN          0           100            30             0           100            10
 2:     POWER_SAVING         10            0             30             -             -             -
 3:            VIDEO          -            -              -            10            16            31 
 4:               VR          0           11             50             0           100            10
 5:          COMPUTE          0            5             30             0           100            10
 6:           CUSTOM          -            -              -             -             -             -
-1:             AUTO          Auto

(Text extracted and paraphrased from the ROCm-smi readme, https://github.com/RadeonOpenCompute/ROC-smi/tree/roc-2.7.0) SCLK_UP_HYST - Delay before sclk is increased (in milliseconds). SCLK_DOWN_HYST - Delay before sclk is decresed (in milliseconds). SCLK_ACTIVE_LEVEL - Workload required before sclk levels change (in %). MCLK_UP_HYST - Delay before mclk is increased (in milliseconds). MCLK_DOWN_HYST - Delay before mclk is decresed (in milliseconds). MCLK_ACTIVE_LEVEL - Workload required before mclk levels change (in %). Values displayed as '-' are hidden fields and are not enabled. When a compute queue is detected, the COMPUTE Power Profile values will be automatically applied to the system, provided that the Perf Level is set to "auto". The CUSTOM Power Profile is only applied when the Performance Level is set to "manual" and can be specified using ROCm-smi (??with rocm loaded??). It is not possible to modify non-CUSTOM Profiles because these are hard-coded by the kernel.

Maybe include this descriptive text in --ppm terminal output instead of adding it to the User Guide?

In the "Using amdgpu-monitor" section, Need to update the terminal output and GUI graphics and include descriptive text for Memory Load monitoring.

In the "Using amdgpu-pac" section, Add after, "If you know how to obtain the current value, please let me know!"... "When changing sclk P-state MHz or mV, the desired P-state mask, if different from default, will have to be re-entered for speed or voltage changes to be applied." At least this is how it has been working for me.

Need to get confirmation that ver.3.0 works with RX 5xxx-series (Navi) cards?

In the "Setting GPU Automatically at Startup" section, Change section header to "Running Startup amdgpu-pac Bash Files". (and change ToC index entry) Add instruction for setting up $HWMON variables to handle shifting hwmon# (thus increasing chances of bash files writing desired GPU parameters)? Probably don't need to use --force_write option for startup bash file; just need to the changes from default settings.

Ricks-Lab commented 4 years ago

Updating the User Guide is definitely a good idea. Let me know of any issues making pull requests. I will answer each concern is a separate response.

Ricks-Lab commented 4 years ago

For grub updates, I found that the updates were not consistently effective without a reboot. amdgpu-utils reads the featuremask from the first line of the 'ppfeatures' file in the cards device directory. It is only read when a utility if first executed.

Ricks-Lab commented 4 years ago

The ppm table changes between generations of GPUs. Here is what it looks like for Vega20:

PROFILE_INDEX(NAME) CLOCK_TYPE(NAME) FPS UseRlcBusy MinActiveFreqType MinActiveFreq BoosterFreqType BoosterFreq PD_Data_limit_c PD_Data_error_coeff PD_Data_error_rate_coeff
 0 BOOTUP_DEFAULT :
                    0(       GFXCLK)       0       0       1       0       4     800 4587520  -65536       0
                    1(       SOCCLK)       0       0       1       0       4     800  327680   -6553       0
                    2(         UCLK)       0       0       1       0       4     800  327680  -65536       0
                    3(         FCLK)       0       0       0       0       4     800  327680   -6553       0
 1 3D_FULL_SCREEN :
                    0(       GFXCLK)       0       1       1       0       4     800 4587520  -65536       0
                    1(       SOCCLK)       0       1       4     850       4     800  327680  -65536       0
                    2(         UCLK)       0       1       4     850       4     800  327680  -65536       0
                    3(         FCLK)       0       1       4     850       4     800  327680  -65536       0
 2   POWER_SAVING :
                    0(       GFXCLK)       0       0       1       0       3       0 5898240  -65536       0
                    1(       SOCCLK)       0       0       1       0       3       0 1310720   -6553       0
                    2(         UCLK)       0       0       1       0       3       0 1966080  -65536       0
                    3(         FCLK)       0       0       0       0       3     800 1966080   -6553       0
 3          VIDEO :
                    0(       GFXCLK)       0       1       1       0       4     500 4587520   -6553       0
                    1(       SOCCLK)       0       0       1       0       4     500 1310720   -6553       0
                    2(         UCLK)       0       0       1       0       4     500 1966080  -65536       0
                    3(         FCLK)       0       0       3       0       4     500 1966080   -6553       0
 4             VR :
                    0(       GFXCLK)       0       1       0    1540       4     800 5898240   -6553   65536
                    1(       SOCCLK)       0       1       2       0       4     800  327680  -32768  -65536
                    2(         UCLK)       0       1       2       0       4     800  327680  -32768  -65536
                    3(         FCLK)       0       1       2       0       4     800  327680  -32768  -65536
 5        COMPUTE*:
                    0(       GFXCLK)       0       1       0    1600       3       0 3932160  -65536  -65536
                    1(       SOCCLK)       0       0       4     850       3       0  327680  -65536  -32768
                    2(         UCLK)       0       0       4     850       3       0  327680  -65536  -32768
                    3(         FCLK)       0       0       4     850       3       0  327680  -65536  -32768
 6         CUSTOM :
                    0(       GFXCLK)       0       0       1       0       4     800 4587520  -65536       0
                    1(       SOCCLK)       0       0       1       0       4     800  327680   -6553       0
                    2(         UCLK)       0       0       1       0       4     800  327680  -65536       0
                    3(         FCLK)       0       0       0       0       4     800  327680   -6553       0

Not sure of the best approach, but since amdgpu-utils does not include the ability to manage details of the table, maybe a simplified version is best. Perhaps, I could develop an option to display the entire contents of the pp_power_profile_mode device file.

Ricks-Lab commented 4 years ago

For the p-state masks, I don't think it is possible to indicate what the current mask is set to. I can only show the default mask. The p-state definition section will display current values of Freq and Voltage, but the mask is a different issue.

Ricks-Lab commented 4 years ago

I don't think there is a reason to believe that Navi cards won't be supported. Perhaps there is a delay in kernel or driver full functionality, but I tried a Vega20 soon after it was released, most functionality was available.

Ricks-Lab commented 4 years ago

Not sure of the best approach, but since amdgpu-utils does not include the ability to manage details of the table, maybe a simplified version is best. Perhaps, I could develop an option to display the entire contents of the pp_power_profile_mode device file.

I have pushed a new version that displays the entire contents of ppm after a brief summary. Let me know what you think.

csecht commented 4 years ago

I have pushed a new version that displays the entire contents of ppm after a brief summary. Let me know what you think.

The new --ppm output format is nice, but repeating the mode table seems redundant? (below) Or are the two tables the two possible output options? I like how the asterix denotes the current mode. Did you mean to omit the AUTO line from the full table?

Linux2:~/Desktop/amdgpu-utils$ ./amdgpu-ls --ppm

Card Number: 1
   Card Model: Radeon RX 570
   Card: /sys/class/drm/card1/device
   Power Performance Mode: manual
    0:   BOOTUP_DEFAULT
    1:   3D_FULL_SCREEN
    2:     POWER_SAVING
    3:            VIDEO
    4:               VR
    5:          COMPUTE
    6:           CUSTOM
   -1:             AUTO

NUM        MODE_NAME     SCLK_UP_HYST   SCLK_DOWN_HYST SCLK_ACTIVE_LEVEL     MCLK_UP_HYST   MCLK_DOWN_HYST MCLK_ACTIVE_LEVEL
  0   BOOTUP_DEFAULT:        -                -                -                -                -                -
  1   3D_FULL_SCREEN:        0              100               30                0              100               10
  2     POWER_SAVING:       10                0               30                -                -                -
  3            VIDEO:        -                -                -               10               16               31
  4               VR:        0               11               50                0              100               10
  5        COMPUTE *:        0                5               30                0              100               10
  6           CUSTOM:        -                -                -                -                -                -
csecht commented 4 years ago

For amdgpu-monitor, because Mem Load % is now listed, perhaps change Load % to GPU Load % in the output? I don't see a need to change the Guide text of the "Using amdgpu-monitor" section, but should I update those Guide graphics with my RX 570 cards or do you want to update them with your Vega cards? (I think that the alternate outputs shown for your Vega cards would be more informative.)

Ricks-Lab commented 4 years ago

My Vega64 cards are down. I only have Quad-Fiji and single Vega20 running. Perhaps your 2 cards are a better example for the community.

I have pushed the format changes you suggested, plus a change for consistent labeling: fine tune print/monitor formats