Closed jkosciessa closed 4 weeks ago
This may have something to do with the change to cuda version 8. Switching back to cudacap 5 runs jobs without CUDA errors, and also the occurence of NaNs during acoustic sims may be fixed (https://github.com/Donders-Institute/PRESTUS/issues/48).
By chance, the following cuda 8 GPU gave me errors even during water simulations. Perhaps it is broken? Don't know why DeviceAvailable=False
given that the GPU was part of the job.
Now regularly getting this when trying to run with qsub (mentat004
):
{Error using gpuDevice
Failed to initialize graphics driver for computation. The CUDA error was:
CUDA_ERROR_UNKNOWN
Error in single_subject_pipeline (line 101)
gpuDevice()
Error in tpd34a1b53_42b4_4c29_86f6_ec4a8168db94 (line 1)
load /project/2423053.01/amythal_sim/amythal_sim/data/tussim/CTX500-026-010_79.6mm_60W/sub-001/batch_job_logs/tpc9087f44_187b_4137_8db4_4e91b438656c.mat; cd /project/2423053.01/amythal_sim/amythal_sim/tools/PRESTUS; single_subject_pipeline(subject_id, parameters); delete /project/2423053.01/amythal_sim/amythal_sim/data/tussim/CTX500-026-010_79.6mm_60W/sub-001/batch_job_logs/tpc9087f44_187b_4137_8db4_4e91b438656c.mat; delete /project/2423053.01/amythal_sim/amythal_sim/data/tussim/CTX500-026-010_79.6mm_60W/sub-001/batch_job_logs/tpd34a1b53_42b4_4c29_86f6_ec4a8168db94.m;
}
SLURM (mentat005
) works fine, but restricts to two GPU jobs apparently.
On mentat002
and mentat004
in an interactive job with min. cudacap >=5.0; MATLAB R2024a (default):
Error using gpuDevice (line 26)
Graphics driver is out of date. Download and install the latest graphics driver for your GPU from NVIDIA.
On mentat002
in an interactive job with min. cudacap >=8.0; MATLAB R2022b (PRESTUS recommendation):
qsub -I -l 'nodes=1:gpus=1,feature=cuda,walltime=05:00:00,mem=24gb,reqattr=cudacap>=8.0'
Error using gpuDevice
Failed to initialize graphics driver for computation. The CUDA error was: CUDA_ERROR_UNKNOWN
On mentat004
I ran gpuDevice
in MATLAB2023b repeatedly with success:
CUDADevice with properties:
Name: 'Tesla P100-PCIE-16GB'
Index: 1
ComputeCapability: '6.0'
SupportsDouble: 1
GraphicsDriverVersion: '460.73.01'
DriverModel: 'N/A'
ToolkitVersion: 11.8000
MaxThreadsPerBlock: 1024
MaxShmemPerBlock: 49152 (49.15 KB)
MaxThreadBlockSize: [1024 1024 64]
MaxGridSize: [2.1475e+09 65535 65535]
SIMDWidth: 32
TotalMemory: 17071734784 (17.07 GB)
AvailableMemory: 16663707648 (16.66 GB)
CachePolicy: 'maximum'
MultiprocessorCount: 56
ClockRateKHz: 1328500
ComputeMode: 'Exclusive process'
GPUOverlapsTransfers: 1
KernelExecutionTimeout: 0
CanMapHostMemory: 1
DeviceSupported: 1
DeviceAvailable: 1
DeviceSelected: 1
The same on mentat002
and MATLAB R2023b is not graced with success:
Error using gpuDevice
Failed to initialize graphics driver for computation. The CUDA error was: CUDA_ERROR_UNKNOWN
@jkosciessa based on the error messages in the comment above, e.g.
Error using gpuDevice (line 26)
Graphics driver is out of date. Download and install the latest graphics driver for your GPU from NVIDIA.
It could be the driver issue or you are loading a too-new cuda libary?? It is a complex compatibility match between Kernel Driver, Cuda library, user application to make a GPU application work.
We haven't upgrade the Nvidia Linux Kernel driver since the beginning of the node was installed, and newly installed GPU nodes might have newer Kernel driver. This is why I was asking the compute node on which you run into this issue. The mentat004
and mentat002
are just access node from which you start an interactive job. The interactive job runs on a compute node on which the GPU is allocated.
You should be able to tell the compute node by running hostname
in the interactive job.
@hurngchunlee The hostname
in this particular case is dccn-c052.dccn.nl
.
It would indeed explain the observed patterns if some jobs get launched on compute nodes with varying drivers etc. some of which MATLAB takes offense with.
Here are some nodes on which jobs successfully loaded a GPU, and some that crashed with gpuDevice
errors:
Success
dccn-c047
6.0
dccn-c077
8.0
Fail
dccn-c078
@hurngchunlee Looking at multiple output logs, the above hosts very reliably either detect the GPU or crash. In most crashes, dccn-c078
becomes immediately reallocated to the next scheduled job, which explains a high failure rate of our GPU jobs.
@hurngchunlee Here is a script that an admin could use to check for systematic differences between those nodes. Can someone directly access the nodes without scheduling them first? Or directly address the desired node in a job?
weird ... the hardware and configuration on dccn-c078 is identical to dccn-c077 ... so I don't see the reason why it runs fine on dccn-c077 but failed on dccn-c078.
Do you know which cuda library you are using? On those nodes (and in general Torque GPU nodes), the kernel driver supports only up to cuda 11.2. If you happen to use a version newer, you will get a cuda error.
@hurngchunlee Interesting, according to gpuDevice()
, my current job uses ToolkitVersion: 11.8000
(see also this output).
But this is also the case for the successful deployments.
@hurngchunlee Here is a script that an admin could use to check for systematic differences between those nodes. Can someone directly access the nodes without scheduling them first? Or directly address the desired node in a job?
here after is the result:
Checking dccn-c047.dccn.nl...
Node: dccn-c047.dccn.nl
Kernel Version: 4.19.94-300.el7.x86_64
NVIDIA Driver Version: 460.73.01
CUDA not installed or nvcc not available
---
Checking dccn-c077.dccn.nl...
Node: dccn-c077.dccn.nl
Kernel Version: 4.19.94-300.el7.x86_64
NVIDIA Driver Version: 460.73.01
460.73.01
CUDA not installed or nvcc not available
---
Checking dccn-c078.dccn.nl...
Node: dccn-c078.dccn.nl
Kernel Version: 4.19.94-300.el7.x86_64
NVIDIA Driver Version: 460.73.01
460.73.01
CUDA not installed or nvcc not available
---
In my environment, I don't load cuda by default. Therefore the "nvcc" not available.
@hurngchunlee Here an updated file that loads the cuda
module.
I suppose the CUDA version would be illuminating?
The DriverVersion
is always missing (N/A
). Perhaps this gets closer to the issue, as ideally it should be larger than ToolkitVersion
. If ToolkitVersion
is already 11.8
, then this may lead to compatibility issues with DriverVersion
<= 11.2
?
I don't know what DriverVersion it refers to. If it is about the Kernel driver version, it is something like 460.73.01
. You can find this information from the first line of the nvidia-smi
output:
$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-PCIE-40GB Off | 00000000:21:00.0 Off | 0 |
| N/A 41C P0 38W / 250W | 0MiB / 40536MiB | 33% E. Process |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 A100-PCIE-40GB Off | 00000000:E2:00.0 Off | 0 |
| N/A 38C P0 58W / 250W | 27650MiB / 40536MiB | 0% E. Process |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 1 N/A N/A 28886 C ...R2023a/bin/glnxa64/MATLAB 27647MiB |
+-----------------------------------------------------------------------------+
The CUDA Version: 11.2
indicates the highest cuda library version it supports.
@hurngchunlee Ok, some clarity: it seems that MATLAB R2024a dropped support for the currently installed ToolkitVersion 11.8
, as per release notes. Is suppose this means that R2024a will not be usable for GPU jobs on the HPC (albeit being the current default).
Still doesn't solve the mystery of dccn-c078
for me, but good to know...
MATLAB R2022a should then be the most robust version for use with GPUs with drivers and toolkit (i.e., supplementary tools) at 11.2
:
CUDADevice with properties:
Name: 'Tesla P100-PCIE-16GB'
Index: 1
ComputeCapability: '6.0'
SupportsDouble: 1
DriverVersion: 11.2000
ToolkitVersion: 11.2000
MaxThreadsPerBlock: 1024
MaxShmemPerBlock: 49152
MaxThreadBlockSize: [1024 1024 64]
MaxGridSize: [2.1475e+09 65535 65535]
SIMDWidth: 32
TotalMemory: 1.7072e+10
AvailableMemory: 1.6672e+10
MultiprocessorCount: 56
ClockRateKHz: 1328500
ComputeMode: 'Exclusive process'
GPUOverlapsTransfers: 1
KernelExecutionTimeout: 0
CanMapHostMemory: 1
DeviceSupported: 1
DeviceAvailable: 1
DeviceSelected: 1
It does make sense to read Release Notes 😅.
Any new setup will be applied to Slurm. The Torque cluster is managed as it is until we migrate full cluster to Slurm. There is no plan to perform any upgrade on the current Torque cluster because the OS running on it is already EOL.
On Slurm, the OS is new and the driver supports up to 12.2.
I would encourage using Slurm. You mentioned that there is a limit of 2 GPUs per user, how many concurrent GPUs do you need? We could temporarily increase it to 4 for you; but I think it should be a limit there to avoid one user blocking all GPUs (despite that in Slurm we have only 11 GPUs at the moment).
MATLAB R2022b should then be the most robust version for use with GPUs with drivers and toolkit (i.e., supplementary tools) at
11.2
:CUDADevice with properties: Name: 'Tesla P100-PCIE-16GB' Index: 1 ComputeCapability: '6.0' SupportsDouble: 1 DriverVersion: 11.2000 ToolkitVersion: 11.2000 MaxThreadsPerBlock: 1024 MaxShmemPerBlock: 49152 MaxThreadBlockSize: [1024 1024 64] MaxGridSize: [2.1475e+09 65535 65535] SIMDWidth: 32 TotalMemory: 1.7072e+10 AvailableMemory: 1.6672e+10 MultiprocessorCount: 56 ClockRateKHz: 1328500 ComputeMode: 'Exclusive process' GPUOverlapsTransfers: 1 KernelExecutionTimeout: 0 CanMapHostMemory: 1 DeviceSupported: 1 DeviceAvailable: 1 DeviceSelected: 1
could be ... but I cannot guarantee it. I never run any GPU programs through Matlab.
@hurngchunlee That's good to hear... so MATLAB 2024a support on SLURM then. Explains why our jobs all run fine there.
I wouldn't create special rules for me right now. There will be multiple people running GPU-dependent simulations soon. Hence, a shared scheduling bottleneck. That's why we need to create some recommendations on what combinations work. I am in full favor of migrating GPUs toward SLURM.
We can for now try to specify module load matlab/R2022b
as a default for our jobs and see whether that does the trick on PBS. @MaCuinea @sirmrmarty
I would encourage using Slurm. You mentioned that there is a limit of 2 GPUs per user, how many concurrent GPUs do you need? We could temporarily increase it to 4 for you; but I think it should be a limit there to avoid one user blocking all GPUs (despite that in Slurm we have only 11 GPUs at the moment).
If you need us to increase it, please send a ticket to helpdesk@donders.ru.nl with me in c.c.
@hurngchunlee On R2022b (CUDA driver spec 11.2), GPU jobs run fine except for node dccn-c078
. There, gpuDevice
will reliably fail. Is there a way to exclude this node from a job call? It would also be great if you could check the driver versions on that node again.
@jkosciessa thanks for pinning it down to this particular node. I also checked it with GPU sample program from Nvidia. The same cuda executable running fine on dccn-c077
failed on dccn-c078
. It is consistent with your finding. We have set the node offline for investigation.
@jkosciessa after restarting the server dccn-c078
, my test program runs OK with CUDA 11.2. I will bring it online again. Could you make a quick test using Matlab?
@hurngchunlee That's good to hear. Is there a way to specifically schedule this server? If not we can look for it in new jobs. @sirmrmarty could you look out for this server in new jobs?
Is there a way to specifically schedule this server?
You could add nodes=dccn-c078.dccn.nl
as part of the resource requirement (i.e. the value of -l
option).
Hey @hurngchunlee and @jkosciessa ,
I ran a few things on the cluster and specified the node=dccn-c078.dccn.nl. The transducer positing scripts didn't run successfully (error below). The simulation seem to run for now (I will give an update it that changes). No CUDA error reported until now
Here an output file of a transducer positioning but I assume that based on the script not the CUDA:
----------------------------------------
Begin PBS Prologue Wed Oct 9 13:16:38 CEST 2024 1728472598
Job ID: 54309434.dccn-l029.dccn.nl
Username: marwim
Group: neuromod
Asked resources: nodes=1:gpus=1,mem=8gb,walltime=01:00:00,neednodes=1:gpus=1
Queue: short
Nodes: dccn-c078.dccn.nl
----------------------------------------
Limiting memory+swap to 9126805504 bytes ...
End PBS Prologue Wed Oct 9 13:16:38 CEST 2024 1728472598
----------------------------------------
Starting matlab/R2022b
Executing /opt/matlab/R2022b/bin/matlab -singleCompThread -nodisplay -batch tp39b22298_7ae2_4afd_b003_e44c04711a39
Adding /project/2424103.01/thalstim_simulations/thalstim_sim/tools/PRESTUS/functions/../functions
Adding /project/2424103.01/thalstim_simulations/thalstim_sim/tools/PRESTUS/functions/../toolboxesand subfolders
Adding /project/2424103.01/thalstim_simulations/thalstim_sim/tools/PRESTUS/toolboxes/k-wave/k-Wave
Current target: left_PUL
status =
0
result =
0x0 empty char array
simnibs_coords =
-16.2782 -25.9278 7.4805
target =
91 97 155
ans =
Columns 1 through 7
181.9321 -5.9618 236.1018 31.4229 172.0989 8.2863 226.2686
Column 8
45.6710
----------------------------------------
Begin PBS Epilogue Wed Oct 9 13:17:08 CEST 2024 1728472628
Job ID: 54309434.dccn-l029.dccn.nl
Job Exit Code: 1
Username: marwim
Group: neuromod
Job Name: tusim_tp_sub-002
Session: 11348
Asked resources: nodes=1:gpus=1,mem=8gb,walltime=01:00:00,neednodes=1:gpus=1
Used resources: cput=00:00:19,walltime=00:00:26,mem=2975182848b
Queue: short
Nodes: dccn-c078.dccn.nl
End PBS Epilogue Wed Oct 9 13:17:08 CEST 2024 1728472628
----------------------------------------
Here the error log
Caught "std::exception" Exception message is:
merge_sort: failed to synchronize
The full simulation appears to have completed without error, perhaps the problem above arises due to rerunning the script over existing outputs?
I will close this issue for now. With the updated documentation and the node debugging I am optimistic that the jobs should now consistently run with qsub again, (as well as with SLURM).
I encounter occasional CUDA errors during acoustic simulations. I find this error hard to debug, because a comparable simulation in CPU mode appears to run without problems.