AMReX-Astro / Castro

Castro (Compressible Astrophysics): An adaptive mesh, astrophysical compressible (radiation-, magneto-) hydrodynamics simulation code for massively parallel CPU and GPU architectures.
http://amrex-astro.github.io/Castro
Other
299 stars 97 forks source link

Issues with running things on GPU in Perlmutter #2288

Closed zhichen3 closed 1 year ago

zhichen3 commented 1 year ago

Trying to run some problems using GPU in perlmutter.

However, the output file says that I'm stuck at initializing rate table, and stops due to time limit. The content in the output file is something like:

Initializing CUDA...
CUDA initialized with 1 GPU per MPI rank; 64 GPU(s) used in total
MPI initialized with 64 MPI processes
MPI initialized with thread support level 3
AMReX (22.09-18-g3e5cc7780280) initialized

Starting run at 05:31:02 UTC on 2022-10-29.
Successfully read inputs file ... 

Castro git describe: 22.09-3-g41697201c
AMReX git describe: 22.09-18-g3e5cc7780
Microphysics git describe: 22.09

reading extern runtime parameters ...

 Initializing rate table
slurmstepd: error: *** STEP 3515755.0 ON nid001025 CANCELLED AT 2022-10-29T07:31:00 DUE TO TIME LIMIT ***
slurmstepd: error: *** JOB 3515755 ON nid001025 CANCELLED AT 2022-10-29T07:31:00 DUE TO TIME LIMIT ***

Then I tried to run cuda-memcheck with the executable. I see a warning/error message saying: ========= Program hit CUDA_ERROR_INVALID_VALUE (error 1) due to "invalid argument" on CUDA API call to cuPointerGetAttribute. But, it is actually doing calculations because I see that it is initializing grids and other stuff.

Then I tried to test cuda-memcheck on the Sedov problem executable, and have the same error message of CUDA_ERROR_INVALID_VALUE. But, I was able to just run the executable and run things nicely according to the output file.

zhichen3 commented 1 year ago

I also tried setting network.use_tables = 0. But it still doesn't seem to work.

zingale commented 1 year ago

can you update to the latest versions of AMReX, Castro, and Microphysics and try again?

zingale commented 1 year ago

also, what is the name of the inputs file you are using?

zhichen3 commented 1 year ago

I'll try again, I'm using inputs_He/inputs.He.1000Hz

zingale commented 1 year ago

I can reproduce this.

If I turn off the rate tables, it hangs on

reading extern runtime parameters ...

So I think that's actually where the problem lies.

zingale commented 1 year ago

note: the problem is still there all the way back for version 22.06 (haven't tried earlier)

zingale commented 1 year ago

The problem seems to be in eos_init()

If I add prints in Castro_setup.cpp, I get past read_params(), extern_init(), init_prob_parameters(), network_init()

zingale commented 1 year ago

this issue is the broadcast of the helmeos table:

Commenting out:

    amrex::ParallelDescriptor::Bcast(&f[0][0][0],    9 * imax * jmax);
    amrex::ParallelDescriptor::Bcast(&dpdf[0][0][0], 4 * imax * jmax);
    amrex::ParallelDescriptor::Bcast(&ef[0][0][0],   4 * imax * jmax);
    amrex::ParallelDescriptor::Bcast(&xf[0][0][0],   4 * imax * jmax);

and doing the read on all procs makes things work.

Going to close this issue and open one in Microphysics to handle a proper fix.