FoldingAtHome / fah-issues

49 stars 9 forks source link

Core doesn't pull in 2 digit GPU indicies. (Linux) #1337

Closed mark-liqid closed 4 years ago

mark-liqid commented 4 years ago

With high GPU count systems FAH core seems to be ignoring the second digit in a two digit GPU index. Each time a GPU with an identifier greater than 10 is assigned a WU FAH spins that WU up on GPU 1.

ie: 16:29:07:WU02:FS02:Starting 16:29:07:WU02:FS02:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64 bit/Core_22.fah/FahCore_22 -dir 02 -suffix 01 -version 705 -lifeline 10853 -checkpoint 15 -gpu-vendor nvidia -opencl-p latform 0 -opencl-device 1 -cuda-device 1 -gpu 1 16:29:07:WU02:FS02:Started FahCore on PID 14491

nvidia-smi

Wed Mar 25 10:38:27 2020
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.64.00 Driver Version: 440.64.00 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Quadro RTX 8000 On | 00000000:87:00.0 Off | Off | | 33% 37C P8 10W / 260W | 12MiB / 48601MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Quadro RTX 8000 On | 00000000:88:00.0 Off | Off | | 53% 74C P2 255W / 260W | 1937MiB / 48601MiB | 93% Default | +-------------------------------+----------------------+----------------------+ | 2 Quadro RTX 8000 On | 00000000:89:00.0 Off | Off | | 51% 73C P2 265W / 260W | 343MiB / 48601MiB | 88% Default | +-------------------------------+----------------------+----------------------+ | 3 Quadro RTX 8000 On | 00000000:8A:00.0 Off | Off | | 33% 33C P8 12W / 260W | 12MiB / 48601MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 4 Quadro RTX 8000 On | 00000000:8B:00.0 Off | Off | | 33% 35C P8 11W / 260W | 12MiB / 48601MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 5 Quadro RTX 8000 On | 00000000:8C:00.0 Off | Off | | 57% 77C P2 258W / 260W | 237MiB / 48601MiB | 89% Default | +-------------------------------+----------------------+----------------------+ | 6 Quadro RTX 8000 On | 00000000:8D:00.0 Off | Off | | 33% 36C P8 7W / 260W | 12MiB / 48601MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 7 Quadro RTX 8000 On | 00000000:8E:00.0 Off | Off | | 33% 35C P8 18W / 260W | 12MiB / 48601MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 8 Quadro RTX 8000 On | 00000000:8F:00.0 Off | Off | | 56% 76C P2 258W / 260W | 335MiB / 48601MiB | 94% Default | +-------------------------------+----------------------+----------------------+ | 9 Quadro RTX 8000 On | 00000000:90:00.0 Off | Off | | 52% 73C P2 259W / 260W | 285MiB / 48601MiB | 89% Default | +-------------------------------+----------------------+----------------------+ | 10 Quadro RTX 8000 On | 00000000:C6:00.0 Off | Off | | 33% 34C P8 9W / 260W | 12MiB / 48601MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 11 Quadro RTX 8000 On | 00000000:C7:00.0 Off | Off | | 33% 34C P8 13W / 260W | 12MiB / 48601MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 12 Quadro RTX 8000 On | 00000000:C8:00.0 Off | Off | | 33% 34C P8 7W / 260W | 12MiB / 48601MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 13 Quadro RTX 8000 On | 00000000:C9:00.0 Off | Off | | 33% 35C P8 23W / 260W | 12MiB / 48601MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 14 Quadro RTX 8000 On | 00000000:CA:00.0 Off | Off | | 33% 35C P8 12W / 260W | 12MiB / 48601MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 15 Quadro RTX 8000 On | 00000000:CB:00.0 Off | Off | | 33% 33C P8 16W / 260W | 12MiB / 48601MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 16 Quadro RTX 8000 On | 00000000:CC:00.0 Off | Off | | 33% 35C P8 19W / 260W | 12MiB / 48601MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 17 Quadro RTX 8000 On | 00000000:CD:00.0 Off | Off | | 33% 32C P8 7W / 260W | 12MiB / 48601MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 18 Quadro RTX 8000 On | 00000000:CE:00.0 Off | Off | | 33% 34C P8 7W / 260W | 12MiB / 48601MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 19 Quadro RTX 8000 On | 00000000:CF:00.0 Off | Off | | 33% 33C P8 9W / 260W | 12MiB / 48601MiB | 0% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 1 10872 C ...org/v7/lin/64bit/Core_22.fah/FahCore_22 225MiB | | 1 11286 C ...org/v7/lin/64bit/Core_22.fah/FahCore_22 331MiB | | 1 12087 C ...org/v7/lin/64bit/Core_22.fah/FahCore_22 323MiB | | 1 12888 C ...org/v7/lin/64bit/Core_22.fah/FahCore_22 273MiB | | 1 13296 C ...org/v7/lin/64bit/Core_22.fah/FahCore_22 225MiB | | 1 14495 C ...org/v7/lin/64bit/Core_22.fah/FahCore_22 225MiB | | 1 14940 C ...org/v7/lin/64bit/Core_22.fah/FahCore_22 323MiB | | 2 14090 C ...org/v7/lin/64bit/Core_22.fah/FahCore_22 331MiB | | 5 12487 C ...org/v7/lin/64bit/Core_22.fah/FahCore_22 225MiB | | 8 13289 C ...org/v7/lin/64bit/Core_22.fah/FahCore_22 323MiB | | 9 11687 C ...org/v7/lin/64bit/Core_22.fah/FahCore_22 273MiB | +-----------------------------------------------------------------------------+

mark-liqid commented 4 years ago

Looks like I copy pasta'd the wrong log entry. When core calls for GPU 15, the work gets assigned to GPU 1

mark-liqid commented 4 years ago

There is another bug report matching this.