RainbowMiner / RainbowMiner

GPU/CPU Mining script with intelligent profit-switching between miningpools, algorithms, miners, using all possible combinations of devices (NVIDIA, AMD, CPU). Features: actively maintained, uses the top actual miner programs (Bminer, Ccminer, Claymore, Dstm, EnemyZ, Sgminer, T-rex and more) easy setup wizard, webinterface, auto update.
GNU General Public License v3.0
604 stars 167 forks source link

OCProfiles Do Not Work: Incorrect GPU Device Values Used in startoc.sh #1934

Closed Sheeepthief closed 2 years ago

Sheeepthief commented 2 years ago

Hello,

I am using a RainbowMiner installation on Ubuntu 20.04 LTS with multiple GPUs. I have been unsuccessful in using ocprofiles.config.txt and miners.config.txt to overclock a GTX1060 while mining Trex-Ethash. I have triple verified that I have changed the "OCprofile": value under the Ethash section of "Trex-GTX10606GB": to match the name of my desired ocprofile.

I have narrowed down the issue to the startoc_gpu_ethash01.sh file in /RainbowMiner/Bin/NVIDIA-Trex/. This file contains nvidia-settings -a commands that match the core clock and memory clock values from my ocprofiles.config.txt, but specify the wrong GPU according to what nvidia-settings (or nvidia Xserver?) expects.

The command in the startoc_gpu01_ethash.sh: nvidia-settings -a '[gpu:1]/GPUPowerMizerMode=1' -a '[gpu:1]/GPUGraphicsClockOffset[3]=150' -a '[gpu:1]/GPUMemoryTransferRateOffset[3]=300'

The command that actually works to overclock the GTX1060: nvidia-settings -a '[gpu:0]/GPUPowerMizerMode=1' -a '[gpu:0]/GPUGraphicsClockOffset[3]=150' -a '[gpu:0]/GPUMemoryTransferRateOffset[3]=300'

I believe RBM obtains the (incorrect) device value from nvidia-smi, which declares the GTX1060 as GPU:1, while nvidia-settings recognizes the GTX1060 as GPU:0.

How can I get RBM/smi to agree on device values/GPU numbers with nvidia-settings?

RainbowMiner commented 2 years ago

Oh, that could be a bit tricky. Could you please run ./gputest.sh and upload the resulting gputestresult.txt file? Also, a Debug file would be pretty neat. Just open http://localhost:4000 and click "Debug file" on the left hand side. It will create a zip file with all sensitive data x-ed out.

Sheeepthief commented 2 years ago

I had to delete 3 (irrelevant) Cpuminer_Jayddee logs which made the debug zip too large to upload in a comment.

debug_2022-01-12.zip gputestresult.txt

RainbowMiner commented 2 years ago

Thank you. It all looks fine - the OpenCL sorting is correctly aligned with the PCIe bus ids, the miners are started with the correct GPUs. So we have to look a bit closer into the nvidia-settings tool. Could you please start nvidia-settings --query all and upload the result here? Is it possible, that the nvidia-settings doesn't detect the P106?

Sheeepthief commented 2 years ago

I put the query output into a .txt for easier perusing. nvidia-settings -q all.txt

Querying the simpler nvidia-settings -q GPUS, and also being able to manually adjust its fan speed and OC at the Nvidia Xserver would indicate that the P106 is detected just fine, as far as I can tell.

Thank you!

RainbowMiner commented 2 years ago

Got it. Yes, both GPUs are detected fine. But the GPUs are not sorted by their PCIe bus id on your system. It's possible, that you have added a second GPU into a port with lower PCIe bus id. That would mess up the xorg.conf file.

You will have to edit the xorg.conf (/etc/X11/xorg.conf) file and resort the GPUs according to their PCIe bus ids.

Either edit it directly and change the BusId values:

Section "Device"
    Identifier     "Device0"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BusID          "PCI:5:0:0"
EndSection
Section "Device"
    Identifier     "Device1"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BusID          "PCI:6:0:0"
EndSection

Or you try to recreate the xorg.conf file with nvidia-xconfig --enable-all-gpus --separate-x-screens There is a good possiblity, that this command resorts the GPUs in the xorg.conf.

So or so - could you please upload the /etc/X11/xorg.conf file here? To be able to upload it, you might need to add .txt as extension. I'll then make the changes for you.

Sheeepthief commented 2 years ago

On the motherboard, the GTX1060 populates the first/topmost PCIe x16 slot and the P106 populates the second PCIe x16 slot. I am as confused as you as to why the BusID's seem to count "bottom up", where the P106 is ID 5 and the 1060 is ID 6.

Here is the original xorg.conf xorg.txt

For fun, I ran the nvidia-xconfig --enable-all-gpus --seperate-x-screens and found that neither the Device# or BusID's were resorted. xorg_resort_attempt.txt

Do you have any intuition as to why Device# wouldn't automatically increase with BusID? Like, how is Device# "chosen"? Is it simply motherboard PCI slot ID weirdness?

RainbowMiner commented 2 years ago

Ok, try this xorg.conf:

# nvidia-xconfig: X configuration file generated by nvidia-xconfig
# nvidia-xconfig:  version 470.86

Section "ServerLayout"
    Identifier     "Layout0"
    Screen      0  "Screen0"
    Screen      1  "Screen1" RightOf "Screen0"
    InputDevice    "Keyboard0" "CoreKeyboard"
    InputDevice    "Mouse0" "CorePointer"
EndSection

Section "Files"
EndSection

Section "InputDevice"

    # generated from default
    Identifier     "Mouse0"
    Driver         "mouse"
    Option         "Protocol" "auto"
    Option         "Device" "/dev/psaux"
    Option         "Emulate3Buttons" "no"
    Option         "ZAxisMapping" "4 5"
EndSection

Section "InputDevice"

    # generated from default
    Identifier     "Keyboard0"
    Driver         "kbd"
EndSection

Section "Monitor"
    Identifier     "Monitor0"
    VendorName     "Unknown"
    ModelName      "Unknown"
    Option         "DPMS"
EndSection

Section "Monitor"
    Identifier     "Monitor1"
    VendorName     "Unknown"
    ModelName      "Unknown"
    Option         "DPMS"
EndSection

Section "Device"
    Identifier     "Device0"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BoardName      "NVIDIA P106-090"
    BusID          "PCI:5:0:0"
EndSection

Section "Device"
    Identifier     "Device1"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BoardName      "NVIDIA GeForce GTX 1060 6GB"
    BusID          "PCI:6:0:0"
EndSection

Section "Screen"
    Identifier     "Screen0"
    Device         "Device0"
    Monitor        "Monitor0"
    DefaultDepth    24
    Option         "AllowEmptyInitialConfiguration" "True"
    Option         "Coolbits" "31"
    SubSection     "Display"
        Depth       24
    EndSubSection
EndSection

Section "Screen"
    Identifier     "Screen1"
    Device         "Device1"
    Monitor        "Monitor1"
    DefaultDepth    24
    Option         "AllowEmptyInitialConfiguration" "True"
    Option         "Coolbits" "31"
    SubSection     "Display"
        Depth       24
    EndSubSection
EndSection

Here is the file to download: xorg.txt

You might have to reboot your machine, after changing the xorg.conf.

RainbowMiner commented 2 years ago

Is it simply motherboard PCI slot ID weirdness?

Yes, most probably.

Sheeepthief commented 2 years ago

Sidenote: at what point does RBM look at all the configs and create/update the startoc.sh files for each miner? Do I need to fully restart RBM to have changes to ocprofile.config register?

RainbowMiner commented 2 years ago

The startoc.sh file should be created/updated at each miner start. This means, you only have to restart (just kill it) the currently running miner.

Sheeepthief commented 2 years ago

So as far as I can tell, swapping the device numbers in xorg.conf fixed the problem entirely. Thank you! I hope anyone who has the same weird problem I did can find this and resolve the problem themselves.