fireice-uk / xmr-stak

Free Monero RandomX Miner and unified CryptoNight miner
GNU General Public License v3.0
4.05k stars 1.79k forks source link

XMR-STAK terminates when only one GPU has an error #2054

Closed kio3i0j9024vkoenio closed 5 years ago

kio3i0j9024vkoenio commented 5 years ago

Basic information

Issue with the execution

Stability issue

The Problem when one GPU gets an error the XMR-Stak miner terminates whereas it would be better if XMR-Stak would just stop mining on the problematic GPU. It would also be nice if XMR-Stak would identify which index number maps to the PCIe BUS number that way the offending GPU could be identified by Index number and I could log in remotely and remove it from the Nvidia.txt config file.

Much more details on this problem follows:

After running for many hours one of the eleven GTX 750 GPU's threw these two errors on my Ubuntu Monitor:

[12146.826127] NVRM Xid (PCI:0000:0b:00) : 32, Channel ID 00000009 intr 00008000 [12146.827566] NVRM Xid (PCI:0000:0b:00) : 32, Channel ID 00000009 intr 00008000

After digging I found that Xid error 32 is "Invalid or corrupted push buffer stream"

https://docs.nvidia.com/deploy/xid-errors/index.html

XID 32: PBDMA Error

This event is logged when a fault is reported by the DMA controller which manages the communication stream between the NVIDIA driver and the GPU over the PCI-E bus. These failures primarily involve quality issues on PCI, and are generally not caused by user application actions.

As I understand this is a communication error between the PCIe bus and the GPU which means that the PCIe riser or GPU is the problem. No biggie this happens but it would be nice if XMR-STAK wouldn't throw CUDA 417 errors on all the other 10 GPUs and terminate.

[CUDA] Error gpu 7: </home/miner/xmr-stak/xmrstak/backend/nvidia/nvcc_code/cuda_extra.cu>:417 [CUDA] Error gpu 3: </home/miner/xmr-stak/xmrstak/backend/nvidia/nvcc_code/cuda_extra.cu>:417 CUDA] Error gpu 4: </home/miner/xmr-stak/xmrstak/backend/nvidia/nvcc_code/cuda_extra.cu>:417 [CUDA] Error gpu 0: </home/miner/xmr-stak/xmrstak/backend/nvidia/nvcc_code/cuda_extra.cu>:417 [CUDA] Error gpu 5: </home/miner/xmr-stak/xmrstak/backend/nvidia/nvcc_code/cuda_extra.cu>:417 [CUDA] Error gpu 2: </home/miner/xmr-stak/xmrstak/backend/nvidia/nvcc_code/cuda_extra.cu>:417 [CUDA] Error gpu 9: </home/miner/xmr-stak/xmrstak/backend/nvidia/nvcc_code/cuda_extra.cu>:417 [CUDA] Error gpu 1: </home/miner/xmr-stak/xmrstak/backend/nvidia/nvcc_code/cuda_extra.cu>:417 [CUDA] Error gpu 8: </home/miner/xmr-stak/xmrstak/backend/nvidia/nvcc_code/cuda_extra.cu>:417 [CUDA] Error gpu 6: </home/miner/xmr-stak/xmrstak/backend/nvidia/nvcc_code/cuda_extra.cu>:417 [CUDA] Error gpu 10: </home/miner/xmr-stak/xmrstak/backend/nvidia/nvcc_code/cuda_extra.cu>:417 terminate called after throwing an instance of 'std::runtime_error' terminate called recursively Aborted (core dumped)

Spudz76 commented 5 years ago

That's like making a car that can teleport past collisions. Would be super excellent solution, but can't work.

Once the PCIe bus is confused almost anything can happen. Also the driver doesn't support "driver instance per GPU" so the PCIe error likely crashed the other GPUs shared memory also, in the driver, nothing a user-app could do about it. Essentially CUDA was crashed by the one-bad-apple which spoiled everything. Driver can't "sever" a bad GPU from the rest, especially at runtime, just not designed like that (hotplug/failout).

kio3i0j9024vkoenio commented 5 years ago

Okay. Thanks for the explanation.

psychocrypt commented 5 years ago

we are working on a save shutdown. If the miner close or currently crash it should be easy to write a bash script around to remive the line in the config and restart the miner automaticly