Closed kio3i0j9024vkoenio closed 5 years ago
That's like making a car that can teleport past collisions. Would be super excellent solution, but can't work.
Once the PCIe bus is confused almost anything can happen. Also the driver doesn't support "driver instance per GPU" so the PCIe error likely crashed the other GPUs shared memory also, in the driver, nothing a user-app could do about it. Essentially CUDA was crashed by the one-bad-apple which spoiled everything. Driver can't "sever" a bad GPU from the rest, especially at runtime, just not designed like that (hotplug/failout).
Okay. Thanks for the explanation.
we are working on a save shutdown. If the miner close or currently crash it should be easy to write a bash script around to remive the line in the config and restart the miner automaticly
Basic information
Issue with the execution
Do you compiled the miner by our own? Yes
run
./xmr-stak --version-long
and add the output here Version: xmr-stak/2.5.2/752fd1e/master/lin/nvidia-cpu/0Stability issue
The Problem when one GPU gets an error the XMR-Stak miner terminates whereas it would be better if XMR-Stak would just stop mining on the problematic GPU. It would also be nice if XMR-Stak would identify which index number maps to the PCIe BUS number that way the offending GPU could be identified by Index number and I could log in remotely and remove it from the Nvidia.txt config file.
Much more details on this problem follows:
After running for many hours one of the eleven GTX 750 GPU's threw these two errors on my Ubuntu Monitor:
[12146.826127] NVRM Xid (PCI:0000:0b:00) : 32, Channel ID 00000009 intr 00008000 [12146.827566] NVRM Xid (PCI:0000:0b:00) : 32, Channel ID 00000009 intr 00008000
After digging I found that Xid error 32 is "Invalid or corrupted push buffer stream"
https://docs.nvidia.com/deploy/xid-errors/index.html
XID 32: PBDMA Error
This event is logged when a fault is reported by the DMA controller which manages the communication stream between the NVIDIA driver and the GPU over the PCI-E bus. These failures primarily involve quality issues on PCI, and are generally not caused by user application actions.
As I understand this is a communication error between the PCIe bus and the GPU which means that the PCIe riser or GPU is the problem. No biggie this happens but it would be nice if XMR-STAK wouldn't throw CUDA 417 errors on all the other 10 GPUs and terminate.
[CUDA] Error gpu 7: </home/miner/xmr-stak/xmrstak/backend/nvidia/nvcc_code/cuda_extra.cu>:417 [CUDA] Error gpu 3: </home/miner/xmr-stak/xmrstak/backend/nvidia/nvcc_code/cuda_extra.cu>:417 CUDA] Error gpu 4: </home/miner/xmr-stak/xmrstak/backend/nvidia/nvcc_code/cuda_extra.cu>:417 [CUDA] Error gpu 0: </home/miner/xmr-stak/xmrstak/backend/nvidia/nvcc_code/cuda_extra.cu>:417 [CUDA] Error gpu 5: </home/miner/xmr-stak/xmrstak/backend/nvidia/nvcc_code/cuda_extra.cu>:417 [CUDA] Error gpu 2: </home/miner/xmr-stak/xmrstak/backend/nvidia/nvcc_code/cuda_extra.cu>:417 [CUDA] Error gpu 9: </home/miner/xmr-stak/xmrstak/backend/nvidia/nvcc_code/cuda_extra.cu>:417 [CUDA] Error gpu 1: </home/miner/xmr-stak/xmrstak/backend/nvidia/nvcc_code/cuda_extra.cu>:417 [CUDA] Error gpu 8: </home/miner/xmr-stak/xmrstak/backend/nvidia/nvcc_code/cuda_extra.cu>:417 [CUDA] Error gpu 6: </home/miner/xmr-stak/xmrstak/backend/nvidia/nvcc_code/cuda_extra.cu>:417 [CUDA] Error gpu 10: </home/miner/xmr-stak/xmrstak/backend/nvidia/nvcc_code/cuda_extra.cu>:417 terminate called after throwing an instance of 'std::runtime_error' terminate called recursively Aborted (core dumped)