FoldingAtHome / fah-issues

49 stars 9 forks source link

Trouble with Indices and >10 GPUs #1245

Open Funkomancer opened 6 years ago

Funkomancer commented 6 years ago

I'm attempting to fold on a rig containing 13 GPUs (gpu/cuda/opencl-indices 0-12 in config.xml) and have noticed an issue. It seems that WUs assigned to GPUs indexed at 1,10,11, and 12 all seem to be attempting to actually run on the GPU with an index of 1. My guess is that at some point these indices are either being parsed incorrectly or truncated such that they are only represented as a single digit, the first digit. This issue exists on both Linux and Windows. Hopefully, this output from nvidia-smi on Linux helps illustrate the issue:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.67                 Driver Version: 390.67                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1070    Off  | 00000000:06:00.0 Off |                  N/A |
| 82%   82C    P2    88W / 151W |    123MiB /  8119MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 1070    Off  | 00000000:07:00.0 Off |                  N/A |
| 69%   77C    P2   124W / 151W |    440MiB /  8119MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 1070    Off  | 00000000:08:00.0 Off |                  N/A |
| 65%   76C    P2   145W / 151W |    123MiB /  8119MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 1070    Off  | 00000000:09:00.0 Off |                  N/A |
| 82%   82C    P2   109W / 151W |    131MiB /  8119MiB |     97%      Default |
+-------------------------------+----------------------+----------------------+
|   4  GeForce GTX 1070    Off  | 00000000:0A:00.0 Off |                  N/A |
| 82%   82C    P2   117W / 151W |    131MiB /  8119MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   5  GeForce GTX 1070    Off  | 00000000:0D:00.0 Off |                  N/A |
| 82%   82C    P2   152W / 151W |    123MiB /  8119MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   6  GeForce GTX 1070    Off  | 00000000:0E:00.0 Off |                  N/A |
| 82%   82C    P2    90W / 151W |    131MiB /  8119MiB |     97%      Default |
+-------------------------------+----------------------+----------------------+
|   7  GeForce GTX 1070    Off  | 00000000:0F:00.0 Off |                  N/A |
| 74%   78C    P2   150W / 151W |    131MiB /  8119MiB |     97%      Default |
+-------------------------------+----------------------+----------------------+
|   8  GeForce GTX 1070    Off  | 00000000:10:00.0 Off |                  N/A |
| 82%   82C    P2   123W / 151W |    131MiB /  8119MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+
|   9  GeForce GTX 1070    Off  | 00000000:11:00.0 Off |                  N/A |
| 82%   82C    P2    84W / 151W |    113MiB /  8119MiB |     96%      Default |
+-------------------------------+----------------------+----------------------+
|  10  GeForce GTX 1070    Off  | 00000000:12:00.0 Off |                  N/A |
|  0%   44C    P8     9W / 151W |     10MiB /  8119MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|  11  GeForce GTX 1070    Off  | 00000000:13:00.0 Off |                  N/A |
|  0%   43C    P8     9W / 151W |     10MiB /  8119MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|  12  GeForce GTX 1070    Off  | 00000000:14:00.0 Off |                  N/A |
|  0%   42C    P8     9W / 151W |     10MiB /  8119MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      3053      C   ...D64/NVIDIA/Fermi/Core_21.fah/FahCore_21   113MiB |
|    1      3076      C   ...D64/NVIDIA/Fermi/Core_21.fah/FahCore_21   121MiB |
|    1      3104      C   ...D64/NVIDIA/Fermi/Core_21.fah/FahCore_21   103MiB |
|    1      3126      C   ...D64/NVIDIA/Fermi/Core_21.fah/FahCore_21   103MiB |
|    1      3418      C   ...D64/NVIDIA/Fermi/Core_21.fah/FahCore_21   103MiB |
|    2      3450      C   ...D64/NVIDIA/Fermi/Core_21.fah/FahCore_21   113MiB |
|    3      3090      C   ...D64/NVIDIA/Fermi/Core_21.fah/FahCore_21   121MiB |
|    4      3060      C   ...D64/NVIDIA/Fermi/Core_21.fah/FahCore_21   121MiB |
|    5      3111      C   ...D64/NVIDIA/Fermi/Core_21.fah/FahCore_21   113MiB |
|    6      3097      C   ...D64/NVIDIA/Fermi/Core_21.fah/FahCore_21   121MiB |
|    7      3083      C   ...D64/NVIDIA/Fermi/Core_21.fah/FahCore_21   121MiB |
|    8      3069      C   ...D64/NVIDIA/Fermi/Core_21.fah/FahCore_21   121MiB |
|    9      3118      C   ...D64/NVIDIA/Fermi/Core_21.fah/FahCore_21   103MiB |
+-----------------------------------------------------------------------------+

As you can see, by the process list and utilization, there are four processes running on GPU 1 and none on 10-12. Here is the slot configuration from my config.xml file:

  <!-- Folding Slots -->
  <slot id='0' type='GPU'/>
  <slot id='1' type='GPU'/>
  <slot id='2' type='GPU'/>
  <slot id='3' type='GPU'/>
  <slot id='4' type='GPU'/>
  <slot id='5' type='GPU'/>
  <slot id='6' type='GPU'/>
  <slot id='7' type='GPU'/>
  <slot id='8' type='GPU'/>
  <slot id='9' type='GPU'/>
  <slot id='10' type='GPU'/>
  <slot id='11' type='GPU'/>
  <slot id='12' type='GPU'/>

I've also tried manually assigning GPUs to slots with 'gpu-index' tags and rearranging the indices so that different GPUs ended up in different slots. The results are the same, WUs assigned to GPUs indexed at 1,10-12 all run on GPU 1.

If further details or testing are required, please let me know. Thanks.

For reference, I was directed here from https://foldingforum.org/viewtopic.php?f=106&t=30913

ProDigit commented 5 years ago

Is there a possibility the issue lies with your motherboard? Have you tried to do bitcoin mining or other things with all 12 GPUs? Some motherboards have multiple PCIE slots, but only support up to an x amount (usually 6 to 9) GPUs.

Another issue might be, that your CPU has only 16 PCIE lanes for GPU purposes. If the first and second PCIE lanes are running at 4x, there is a possibility that you can only assign 8 more GPUs to your motherboard, resulting in a max of 10 GPUs. Or, if the first one is running at 8x, second at 4x, you can add only 4x more, plus 4 more 1x slots through an m.2 slot.

hra0031 commented 4 years ago

I have the same issue on 16 GPU machine... still no fix?

[0] Tesla V100-SXM3-32GB | 50'C,  87 % |   354 / 32510 MB | fahclient(339M)
[1] Tesla V100-SXM3-32GB | 51'C,  98 % |  2376 / 32510 MB | fahclient(323M) fahclient(339M) fahclient(341M) fahclient(339M) fahclient(341M) fahclient(339M) fahclient(339M)
[2] Tesla V100-SXM3-32GB | 64'C,  87 % |   354 / 32510 MB | fahclient(339M)
[3] Tesla V100-SXM3-32GB | 66'C,  88 % |   354 / 32510 MB | fahclient(339M)
[4] Tesla V100-SXM3-32GB | 46'C,  88 % |   354 / 32510 MB | fahclient(339M)
[5] Tesla V100-SXM3-32GB | 62'C,  87 % |   354 / 32510 MB | fahclient(339M)
[6] Tesla V100-SXM3-32GB | 49'C,  90 % |   356 / 32510 MB | fahclient(341M)
[7] Tesla V100-SXM3-32GB | 68'C,  90 % |   356 / 32510 MB | fahclient(341M)
[8] Tesla V100-SXM3-32GB | 50'C,  88 % |   354 / 32510 MB | fahclient(339M)
[9] Tesla V100-SXM3-32GB | 51'C,  93 % |   354 / 32510 MB | fahclient(339M)
[10] Tesla V100-SXM3-32GB | 47'C,   0 % |    13 / 32510 MB |
[11] Tesla V100-SXM3-32GB | 48'C,   0 % |    13 / 32510 MB |
[12] Tesla V100-SXM3-32GB | 32'C,   0 % |    13 / 32510 MB |
[13] Tesla V100-SXM3-32GB | 30'C,   0 % |    13 / 32510 MB |
[14] Tesla V100-SXM3-32GB | 34'C,   0 % |    13 / 32510 MB |
[15] Tesla V100-SXM3-32GB | 35'C,   0 % |    13 / 32510 MB |
anand-bhat commented 4 years ago

Also reported at https://foldingforum.org/viewtopic.php?p=316900#p316900

macdaffy commented 4 years ago

I can confirm the issue. I am running a 17 GPU configuration with a dual Xeon server motherboard for 2 years. No issues with 3D rendering (e.g. octane render) or GPU mining. No motherboard or driver issues.

If i configure config.xml to use GPU-index 0 - 9 the got assigned correctly. GPU-index 10-16 get all assigned to GPU 1.

It is very likely that gpu-index only uses the first digit. 10, 11, 12, 13.. all result in index 1. This is my personal assumption without having any deeper knowledge of the code.

This might be a rather simple fix and would unlock some additional folding power due to the tendency of larger systems (E.g. a Nvidia DGX server with 16 GPUs).

dmc5179 commented 4 years ago

Try using a containerized version and running 1 per GPU. Assign a devices to each container. FAHClient won’t know there’s more than 1.

shorttack commented 4 years ago

Presumed defect on GPU count >9