CUDA errors - Githubissues

jkosciessa commented 3 months ago

I encounter occasional CUDA errors during acoustic simulations. I find this error hard to debug, because a comparable simulation in CPU mode appears to run without problems.

{Error using .*
Encountered unexpected error during CUDA execution. The CUDA error was:
CUDA_ERROR_ILLEGAL_ADDRESS

Error in kspaceFirstOrder3D (line 958)
                source_mat = real(ifftn(source_kappa .* fftn(source_mat)));

Error in run_simulations (line 59)
       sensor_data = kspaceFirstOrder3D(kgrid, medium, source, sensor, input_args_cell{:});

Error in single_subject_pipeline (line 272)
        sensor_data = run_simulations(kgrid, kwave_medium, source, sensor, kwave_input_args, parameters);

Error in tp50cc77e7_df67_41ac_b927_e85810643c23 (line 1)
load /project/2424103.01/thalstim_simulations/thalstim_sim/data/tussim/CTX500-026-010_79.6mm_pCT_60W/sub-002/batch_job_logs/tp55f45bc7_f0a5_4e23_9e69_49e5aece910b.mat; cd /project/2424103.01/thalstim_simulations/thalstim_sim/tools/PRESTUS; single_subject_pipeline(subject_id, parameters); delete /project/2424103.01/thalstim_simulations/thalstim_sim/data/tussim/CTX500-026-010_79.6mm_pCT_60W/sub-002/batch_job_logs/tp55f45bc7_f0a5_4e23_9e69_49e5aece910b.mat; delete /project/2424103.01/thalstim_simulations/thalstim_sim/data/tussim/CTX500-026-010_79.6mm_pCT_60W/sub-002/batch_job_logs/tp50cc77e7_df67_41ac_b927_e85810643c23.m;
}

jkosciessa commented 2 months ago

This may have something to do with the change to cuda version 8. Switching back to cudacap 5 runs jobs without CUDA errors, and also the occurence of NaNs during acoustic sims may be fixed (https://github.com/Donders-Institute/PRESTUS/issues/48).

By chance, the following cuda 8 GPU gave me errors even during water simulations. Perhaps it is broken? Don't know why DeviceAvailable=False given that the GPU was part of the job.

jkosciessa commented 1 month ago

Now regularly getting this when trying to run with qsub (mentat004):

{Error using gpuDevice
Failed to initialize graphics driver for computation. The CUDA error was:
CUDA_ERROR_UNKNOWN

Error in single_subject_pipeline (line 101)
        gpuDevice()

Error in tpd34a1b53_42b4_4c29_86f6_ec4a8168db94 (line 1)
load /project/2423053.01/amythal_sim/amythal_sim/data/tussim/CTX500-026-010_79.6mm_60W/sub-001/batch_job_logs/tpc9087f44_187b_4137_8db4_4e91b438656c.mat; cd /project/2423053.01/amythal_sim/amythal_sim/tools/PRESTUS; single_subject_pipeline(subject_id, parameters); delete /project/2423053.01/amythal_sim/amythal_sim/data/tussim/CTX500-026-010_79.6mm_60W/sub-001/batch_job_logs/tpc9087f44_187b_4137_8db4_4e91b438656c.mat; delete /project/2423053.01/amythal_sim/amythal_sim/data/tussim/CTX500-026-010_79.6mm_60W/sub-001/batch_job_logs/tpd34a1b53_42b4_4c29_86f6_ec4a8168db94.m;
}

SLURM (mentat005) works fine, but restricts to two GPU jobs apparently.

jkosciessa commented 1 month ago

On mentat002 and mentat004 in an interactive job with min. cudacap >=5.0; MATLAB R2024a (default):

Error using gpuDevice (line 26)
Graphics driver is out of date. Download and install the latest graphics driver for your GPU from NVIDIA.

On mentat002 in an interactive job with min. cudacap >=8.0; MATLAB R2022b (PRESTUS recommendation):

qsub -I -l 'nodes=1:gpus=1,feature=cuda,walltime=05:00:00,mem=24gb,reqattr=cudacap>=8.0'

Error using gpuDevice
Failed to initialize graphics driver for computation. The CUDA error was: CUDA_ERROR_UNKNOWN

On mentat004 I ran gpuDevice in MATLAB2023b repeatedly with success:

 CUDADevice with properties:

                      Name: 'Tesla P100-PCIE-16GB'
                     Index: 1
         ComputeCapability: '6.0'
            SupportsDouble: 1
     GraphicsDriverVersion: '460.73.01'
               DriverModel: 'N/A'
            ToolkitVersion: 11.8000
        MaxThreadsPerBlock: 1024
          MaxShmemPerBlock: 49152 (49.15 KB)
        MaxThreadBlockSize: [1024 1024 64]
               MaxGridSize: [2.1475e+09 65535 65535]
                 SIMDWidth: 32
               TotalMemory: 17071734784 (17.07 GB)
           AvailableMemory: 16663707648 (16.66 GB)
               CachePolicy: 'maximum'
       MultiprocessorCount: 56
              ClockRateKHz: 1328500
               ComputeMode: 'Exclusive process'
      GPUOverlapsTransfers: 1
    KernelExecutionTimeout: 0
          CanMapHostMemory: 1
           DeviceSupported: 1
           DeviceAvailable: 1
            DeviceSelected: 1

The same on mentat002 and MATLAB R2023b is not graced with success:

Error using gpuDevice
Failed to initialize graphics driver for computation. The CUDA error was: CUDA_ERROR_UNKNOWN

hurngchunlee commented 1 month ago

@jkosciessa based on the error messages in the comment above, e.g.

Error using gpuDevice (line 26)
Graphics driver is out of date. Download and install the latest graphics driver for your GPU from NVIDIA.

It could be the driver issue or you are loading a too-new cuda libary?? It is a complex compatibility match between Kernel Driver, Cuda library, user application to make a GPU application work.

We haven't upgrade the Nvidia Linux Kernel driver since the beginning of the node was installed, and newly installed GPU nodes might have newer Kernel driver. This is why I was asking the compute node on which you run into this issue. The mentat004 and mentat002 are just access node from which you start an interactive job. The interactive job runs on a compute node on which the GPU is allocated.

You should be able to tell the compute node by running hostname in the interactive job.

jkosciessa commented 1 month ago

@hurngchunlee The hostname in this particular case is dccn-c052.dccn.nl.

It would indeed explain the observed patterns if some jobs get launched on compute nodes with varying drivers etc. some of which MATLAB takes offense with.

Here are some nodes on which jobs successfully loaded a GPU, and some that crashed with gpuDevice errors:

Success dccn-c047 6.0 dccn-c077 8.0

Fail dccn-c078

jkosciessa commented 1 month ago

@hurngchunlee Looking at multiple output logs, the above hosts very reliably either detect the GPU or crash. In most crashes, dccn-c078 becomes immediately reallocated to the next scheduled job, which explains a high failure rate of our GPU jobs.

jkosciessa commented 1 month ago

@hurngchunlee Here is a script that an admin could use to check for systematic differences between those nodes. Can someone directly access the nodes without scheduling them first? Or directly address the desired node in a job?

check_node_gpu.txt

hurngchunlee commented 1 month ago

weird ... the hardware and configuration on dccn-c078 is identical to dccn-c077 ... so I don't see the reason why it runs fine on dccn-c077 but failed on dccn-c078.

Do you know which cuda library you are using? On those nodes (and in general Torque GPU nodes), the kernel driver supports only up to cuda 11.2. If you happen to use a version newer, you will get a cuda error.

jkosciessa commented 1 month ago

@hurngchunlee Interesting, according to gpuDevice(), my current job uses ToolkitVersion: 11.8000 (see also this output).

But this is also the case for the successful deployments.

hurngchunlee commented 1 month ago

@hurngchunlee Here is a script that an admin could use to check for systematic differences between those nodes. Can someone directly access the nodes without scheduling them first? Or directly address the desired node in a job?

check_node_gpu.txt

here after is the result:

Checking dccn-c047.dccn.nl...
Node: dccn-c047.dccn.nl
Kernel Version: 4.19.94-300.el7.x86_64
NVIDIA Driver Version: 460.73.01
CUDA not installed or nvcc not available
---
Checking dccn-c077.dccn.nl...
Node: dccn-c077.dccn.nl
Kernel Version: 4.19.94-300.el7.x86_64
NVIDIA Driver Version: 460.73.01
460.73.01
CUDA not installed or nvcc not available
---
Checking dccn-c078.dccn.nl...
Node: dccn-c078.dccn.nl
Kernel Version: 4.19.94-300.el7.x86_64
NVIDIA Driver Version: 460.73.01
460.73.01
CUDA not installed or nvcc not available
---

In my environment, I don't load cuda by default. Therefore the "nvcc" not available.

jkosciessa commented 1 month ago

@hurngchunlee Here an updated file that loads the cuda module. I suppose the CUDA version would be illuminating?

check_node_gpu.txt

jkosciessa commented 1 month ago

The DriverVersion is always missing (N/A). Perhaps this gets closer to the issue, as ideally it should be larger than ToolkitVersion. If ToolkitVersion is already 11.8, then this may lead to compatibility issues with DriverVersion <= 11.2?

hurngchunlee commented 1 month ago

I don't know what DriverVersion it refers to. If it is about the Kernel driver version, it is something like 460.73.01. You can find this information from the first line of the nvidia-smi output:

$ nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-PCIE-40GB      Off  | 00000000:21:00.0 Off |                    0 |
| N/A   41C    P0    38W / 250W |      0MiB / 40536MiB |     33%   E. Process |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  A100-PCIE-40GB      Off  | 00000000:E2:00.0 Off |                    0 |
| N/A   38C    P0    58W / 250W |  27650MiB / 40536MiB |      0%   E. Process |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    1   N/A  N/A     28886      C   ...R2023a/bin/glnxa64/MATLAB    27647MiB |
+-----------------------------------------------------------------------------+

hurngchunlee commented 1 month ago

The CUDA Version: 11.2 indicates the highest cuda library version it supports.

jkosciessa commented 1 month ago

@hurngchunlee Ok, some clarity: it seems that MATLAB R2024a dropped support for the currently installed ToolkitVersion 11.8, as per release notes. Is suppose this means that R2024a will not be usable for GPU jobs on the HPC (albeit being the current default).

Still doesn't solve the mystery of dccn-c078 for me, but good to know...

jkosciessa commented 1 month ago

MATLAB R2022a should then be the most robust version for use with GPUs with drivers and toolkit (i.e., supplementary tools) at 11.2:

  CUDADevice with properties:

                      Name: 'Tesla P100-PCIE-16GB'
                     Index: 1
         ComputeCapability: '6.0'
            SupportsDouble: 1
             DriverVersion: 11.2000
            ToolkitVersion: 11.2000
        MaxThreadsPerBlock: 1024
          MaxShmemPerBlock: 49152
        MaxThreadBlockSize: [1024 1024 64]
               MaxGridSize: [2.1475e+09 65535 65535]
                 SIMDWidth: 32
               TotalMemory: 1.7072e+10
           AvailableMemory: 1.6672e+10
       MultiprocessorCount: 56
              ClockRateKHz: 1328500
               ComputeMode: 'Exclusive process'
      GPUOverlapsTransfers: 1
    KernelExecutionTimeout: 0
          CanMapHostMemory: 1
           DeviceSupported: 1
           DeviceAvailable: 1
            DeviceSelected: 1

It does make sense to read Release Notes 😅.

hurngchunlee commented 1 month ago

Any new setup will be applied to Slurm. The Torque cluster is managed as it is until we migrate full cluster to Slurm. There is no plan to perform any upgrade on the current Torque cluster because the OS running on it is already EOL.

On Slurm, the OS is new and the driver supports up to 12.2.

I would encourage using Slurm. You mentioned that there is a limit of 2 GPUs per user, how many concurrent GPUs do you need? We could temporarily increase it to 4 for you; but I think it should be a limit there to avoid one user blocking all GPUs (despite that in Slurm we have only 11 GPUs at the moment).

hurngchunlee commented 1 month ago

MATLAB R2022b should then be the most robust version for use with GPUs with drivers and toolkit (i.e., supplementary tools) at 11.2:

  CUDADevice with properties:

                      Name: 'Tesla P100-PCIE-16GB'
                     Index: 1
         ComputeCapability: '6.0'
            SupportsDouble: 1
             DriverVersion: 11.2000
            ToolkitVersion: 11.2000
        MaxThreadsPerBlock: 1024
          MaxShmemPerBlock: 49152
        MaxThreadBlockSize: [1024 1024 64]
               MaxGridSize: [2.1475e+09 65535 65535]
                 SIMDWidth: 32
               TotalMemory: 1.7072e+10
           AvailableMemory: 1.6672e+10
       MultiprocessorCount: 56
              ClockRateKHz: 1328500
               ComputeMode: 'Exclusive process'
      GPUOverlapsTransfers: 1
    KernelExecutionTimeout: 0
          CanMapHostMemory: 1
           DeviceSupported: 1
           DeviceAvailable: 1
            DeviceSelected: 1

could be ... but I cannot guarantee it. I never run any GPU programs through Matlab.

jkosciessa commented 1 month ago

@hurngchunlee That's good to hear... so MATLAB 2024a support on SLURM then. Explains why our jobs all run fine there.

I wouldn't create special rules for me right now. There will be multiple people running GPU-dependent simulations soon. Hence, a shared scheduling bottleneck. That's why we need to create some recommendations on what combinations work. I am in full favor of migrating GPUs toward SLURM.

We can for now try to specify module load matlab/R2022b as a default for our jobs and see whether that does the trick on PBS. @MaCuinea @sirmrmarty

hurngchunlee commented 1 month ago

I would encourage using Slurm. You mentioned that there is a limit of 2 GPUs per user, how many concurrent GPUs do you need? We could temporarily increase it to 4 for you; but I think it should be a limit there to avoid one user blocking all GPUs (despite that in Slurm we have only 11 GPUs at the moment).

If you need us to increase it, please send a ticket to helpdesk@donders.ru.nl with me in c.c.

jkosciessa commented 1 month ago

@hurngchunlee On R2022b (CUDA driver spec 11.2), GPU jobs run fine except for node dccn-c078. There, gpuDevice will reliably fail. Is there a way to exclude this node from a job call? It would also be great if you could check the driver versions on that node again.

hurngchunlee commented 1 month ago

@jkosciessa thanks for pinning it down to this particular node. I also checked it with GPU sample program from Nvidia. The same cuda executable running fine on dccn-c077 failed on dccn-c078. It is consistent with your finding. We have set the node offline for investigation.

hurngchunlee commented 1 month ago

@jkosciessa after restarting the server dccn-c078, my test program runs OK with CUDA 11.2. I will bring it online again. Could you make a quick test using Matlab?

jkosciessa commented 1 month ago

@hurngchunlee That's good to hear. Is there a way to specifically schedule this server? If not we can look for it in new jobs. @sirmrmarty could you look out for this server in new jobs?

hurngchunlee commented 1 month ago

Is there a way to specifically schedule this server?

You could add nodes=dccn-c078.dccn.nl as part of the resource requirement (i.e. the value of -l option).

sirmrmarty commented 1 month ago

Hey @hurngchunlee and @jkosciessa ,

I ran a few things on the cluster and specified the node=dccn-c078.dccn.nl. The transducer positing scripts didn't run successfully (error below). The simulation seem to run for now (I will give an update it that changes). No CUDA error reported until now

Here an output file of a transducer positioning but I assume that based on the script not the CUDA:

----------------------------------------
Begin PBS Prologue Wed Oct  9 13:16:38 CEST 2024 1728472598
Job ID:        54309434.dccn-l029.dccn.nl
Username:      marwim
Group:         neuromod
Asked resources:   nodes=1:gpus=1,mem=8gb,walltime=01:00:00,neednodes=1:gpus=1
Queue:         short
Nodes:         dccn-c078.dccn.nl
----------------------------------------
Limiting memory+swap to 9126805504 bytes ...
End PBS Prologue Wed Oct  9 13:16:38 CEST 2024 1728472598
----------------------------------------
Starting matlab/R2022b
Executing /opt/matlab/R2022b/bin/matlab -singleCompThread -nodisplay -batch tp39b22298_7ae2_4afd_b003_e44c04711a39
Adding /project/2424103.01/thalstim_simulations/thalstim_sim/tools/PRESTUS/functions/../functions
Adding /project/2424103.01/thalstim_simulations/thalstim_sim/tools/PRESTUS/functions/../toolboxesand subfolders
Adding /project/2424103.01/thalstim_simulations/thalstim_sim/tools/PRESTUS/toolboxes/k-wave/k-Wave
Current target: left_PUL

status =

     0

result =

  0x0 empty char array

simnibs_coords =

  -16.2782  -25.9278    7.4805

target =

    91    97   155

ans =

  Columns 1 through 7

  181.9321   -5.9618  236.1018   31.4229  172.0989    8.2863  226.2686

  Column 8

   45.6710

----------------------------------------
Begin PBS Epilogue Wed Oct  9 13:17:08 CEST 2024 1728472628
Job ID:        54309434.dccn-l029.dccn.nl
Job Exit Code:     1
Username:      marwim
Group:         neuromod
Job Name:      tusim_tp_sub-002
Session:       11348
Asked resources:   nodes=1:gpus=1,mem=8gb,walltime=01:00:00,neednodes=1:gpus=1
Used resources:    cput=00:00:19,walltime=00:00:26,mem=2975182848b
Queue:         short
Nodes:         dccn-c078.dccn.nl
End PBS Epilogue Wed Oct  9 13:17:08 CEST 2024 1728472628
----------------------------------------

Here the error log

Caught "std::exception" Exception message is:
merge_sort: failed to synchronize

jkosciessa commented 4 weeks ago

The full simulation appears to have completed without error, perhaps the problem above arises due to rerunning the script over existing outputs?

I will close this issue for now. With the updated documentation and the node debugging I am optimistic that the jobs should now consistently run with qsub again, (as well as with SLURM).

Donders-Institute / PRESTUS

CUDA errors #50