Open Funkomancer opened 6 years ago
Is there a possibility the issue lies with your motherboard? Have you tried to do bitcoin mining or other things with all 12 GPUs? Some motherboards have multiple PCIE slots, but only support up to an x amount (usually 6 to 9) GPUs.
Another issue might be, that your CPU has only 16 PCIE lanes for GPU purposes. If the first and second PCIE lanes are running at 4x, there is a possibility that you can only assign 8 more GPUs to your motherboard, resulting in a max of 10 GPUs. Or, if the first one is running at 8x, second at 4x, you can add only 4x more, plus 4 more 1x slots through an m.2 slot.
I have the same issue on 16 GPU machine... still no fix?
[0] Tesla V100-SXM3-32GB | 50'C, 87 % | 354 / 32510 MB | fahclient(339M)
[1] Tesla V100-SXM3-32GB | 51'C, 98 % | 2376 / 32510 MB | fahclient(323M) fahclient(339M) fahclient(341M) fahclient(339M) fahclient(341M) fahclient(339M) fahclient(339M)
[2] Tesla V100-SXM3-32GB | 64'C, 87 % | 354 / 32510 MB | fahclient(339M)
[3] Tesla V100-SXM3-32GB | 66'C, 88 % | 354 / 32510 MB | fahclient(339M)
[4] Tesla V100-SXM3-32GB | 46'C, 88 % | 354 / 32510 MB | fahclient(339M)
[5] Tesla V100-SXM3-32GB | 62'C, 87 % | 354 / 32510 MB | fahclient(339M)
[6] Tesla V100-SXM3-32GB | 49'C, 90 % | 356 / 32510 MB | fahclient(341M)
[7] Tesla V100-SXM3-32GB | 68'C, 90 % | 356 / 32510 MB | fahclient(341M)
[8] Tesla V100-SXM3-32GB | 50'C, 88 % | 354 / 32510 MB | fahclient(339M)
[9] Tesla V100-SXM3-32GB | 51'C, 93 % | 354 / 32510 MB | fahclient(339M)
[10] Tesla V100-SXM3-32GB | 47'C, 0 % | 13 / 32510 MB |
[11] Tesla V100-SXM3-32GB | 48'C, 0 % | 13 / 32510 MB |
[12] Tesla V100-SXM3-32GB | 32'C, 0 % | 13 / 32510 MB |
[13] Tesla V100-SXM3-32GB | 30'C, 0 % | 13 / 32510 MB |
[14] Tesla V100-SXM3-32GB | 34'C, 0 % | 13 / 32510 MB |
[15] Tesla V100-SXM3-32GB | 35'C, 0 % | 13 / 32510 MB |
Also reported at https://foldingforum.org/viewtopic.php?p=316900#p316900
I can confirm the issue. I am running a 17 GPU configuration with a dual Xeon server motherboard for 2 years. No issues with 3D rendering (e.g. octane render) or GPU mining. No motherboard or driver issues.
If i configure config.xml to use GPU-index 0 - 9 the got assigned correctly. GPU-index 10-16 get all assigned to GPU 1.
It is very likely that gpu-index only uses the first digit. 10, 11, 12, 13.. all result in index 1. This is my personal assumption without having any deeper knowledge of the code.
This might be a rather simple fix and would unlock some additional folding power due to the tendency of larger systems (E.g. a Nvidia DGX server with 16 GPUs).
Try using a containerized version and running 1 per GPU. Assign a devices to each container. FAHClient won’t know there’s more than 1.
Presumed defect on GPU count >9
I'm attempting to fold on a rig containing 13 GPUs (gpu/cuda/opencl-indices 0-12 in config.xml) and have noticed an issue. It seems that WUs assigned to GPUs indexed at 1,10,11, and 12 all seem to be attempting to actually run on the GPU with an index of 1. My guess is that at some point these indices are either being parsed incorrectly or truncated such that they are only represented as a single digit, the first digit. This issue exists on both Linux and Windows. Hopefully, this output from nvidia-smi on Linux helps illustrate the issue:
As you can see, by the process list and utilization, there are four processes running on GPU 1 and none on 10-12. Here is the slot configuration from my config.xml file:
I've also tried manually assigning GPUs to slots with 'gpu-index' tags and rearranging the indices so that different GPUs ended up in different slots. The results are the same, WUs assigned to GPUs indexed at 1,10-12 all run on GPU 1.
If further details or testing are required, please let me know. Thanks.
For reference, I was directed here from https://foldingforum.org/viewtopic.php?f=106&t=30913