Issues with running things on GPU in Perlmutter

zhichen3 commented 1 year ago

Trying to run some problems using GPU in perlmutter.

However, the output file says that I'm stuck at initializing rate table, and stops due to time limit. The content in the output file is something like:

Initializing CUDA...
CUDA initialized with 1 GPU per MPI rank; 64 GPU(s) used in total
MPI initialized with 64 MPI processes
MPI initialized with thread support level 3
AMReX (22.09-18-g3e5cc7780280) initialized

Starting run at 05:31:02 UTC on 2022-10-29.
Successfully read inputs file ... 

Castro git describe: 22.09-3-g41697201c
AMReX git describe: 22.09-18-g3e5cc7780
Microphysics git describe: 22.09

reading extern runtime parameters ...

 Initializing rate table
slurmstepd: error: *** STEP 3515755.0 ON nid001025 CANCELLED AT 2022-10-29T07:31:00 DUE TO TIME LIMIT ***
slurmstepd: error: *** JOB 3515755 ON nid001025 CANCELLED AT 2022-10-29T07:31:00 DUE TO TIME LIMIT ***

Then I tried to run cuda-memcheck with the executable. I see a warning/error message saying: ========= Program hit CUDA_ERROR_INVALID_VALUE (error 1) due to "invalid argument" on CUDA API call to cuPointerGetAttribute. But, it is actually doing calculations because I see that it is initializing grids and other stuff.

Then I tried to test cuda-memcheck on the Sedov problem executable, and have the same error message of CUDA_ERROR_INVALID_VALUE. But, I was able to just run the executable and run things nicely according to the output file.

zhichen3 commented 1 year ago

I also tried setting network.use_tables = 0. But it still doesn't seem to work.