How to use Cuda? - Githubissues

pgrete commented 1 year ago

I'm in the process of adding Ascent support to Parthenon (a performance portable [through Kokkos] adaptive mesh refinement framework), see https://github.com/parthenon-hpc-lab/parthenon/pull/810

Things seem to run fine on the host, but when trying to run on GPUs (only tried Cuda so far), I get the following error at runtime: Kokkos::View ERROR: attempt to access inaccessible memory space (label="advected")

Is there anything special I need to do to "use" cuda?

Based on the Ascent debug output from the Parthenon config, it looks like my Ascent build should support Cuda:

-- Conduit was built with HDF5 Support
-- Looking for HDF5 at: /usr/local/hdf5/parallel
-- Found HDF5: /usr/local/hdf5/parallel/lib/libhdf5.so;/usr/lib/x86_64-linux-gnu/libdl.so;/usr/lib/x86_64-linux-gnu/libm.so (found version "1.12.2")  
-- HDF5_DIR_REAL=/usr/local/hdf5/parallel
-- Checking that found HDF5_INCLUDE_DIRS are in HDF5_DIR
-- HDF5_INCLUDE_DIRS=/usr/local/hdf5/parallel/include
--  /usr/local/hdf5/parallel/include includes HDF5_DIR (/usr/local/hdf5/parallel)
--  /usr/local/hdf5/parallel/include includes HDF5_REAL_DIR ()     
-- HDF5 is parallel:  TRUE                                                                     
-- Found Conduit: /usr/local/conduit-v0.8.6 (found version 0.8.6)
-- CONDUIT_VERSION             = 0.8.6
-- CONDUIT_INSTALL_PREFIX      = /usr/local/conduit-v0.8.6
-- CONDUIT_IMPORT_ROOT         = /usr/local/conduit-v0.8.6                  
-- CONDUIT_USE_CXX11           = TRUE                                                          
-- CONDUIT_USE_FMT             = TRUE                                                          
-- CONDUIT_INCLUDE_DIRS        = /usr/local/conduit-v0.8.6/include/conduit
-- CONDUIT_FORTRAN_ENABLED     = FALSE
-- CONDUIT_PYTHON_ENABLED      =                                                               
-- CONDUIT_PYTHON_EXECUTABLE   = /usr/bin/python                     
-- CONDUIT_PYTHON_MODULE_DIR   = /usr/local/conduit-v0.8.6/python-modules/
-- Conduit Relay features:
--  CONDUIT_RELAY_WEBSERVER_ENABLED = TRUE                                                     
--  CONDUIT_RELAY_HDF5_ENABLED      = TRUE   
--  CONDUIT_HDF5_DIR                = /usr/local/hdf5/parallel
--  CONDUIT_RELAY_ADIOS_ENABLED     = FALSE
--  CONDUIT_ADIOS_DIR               = 
--  CONDUIT_RELAY_SILO_ENABLED      = FALSE
--  CONDUIT_SILO_DIR                = 
--  CONDUIT_RELAY_MPI_ENABLED       = TRUE
-- Conduit imported targets: conduit::conduit conduit::conduit_mpi
-- The CUDA compiler identification is NVIDIA 11.6.124
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- ASCENT_VERSION             = 0.9.0
-- ASCENT_INSTALL_PREFIX      = /usr/local/ascent-develop
-- ASCENT_INCLUDE_DIRS        = /usr/local/ascent-develop/include/ascent
-- ASCENT_FORTRAN_ENABLED     = OFF
-- ASCENT_PYTHON_ENABLED      = 
-- ASCENT_PYTHON_EXECUTABLE   = /usr/bin/python 
-- ASCENT_SERIAL_ENABLED      = ON
-- ASCENT_MPI_ENABLED         = ON
-- ASCENT_CUDA_ENABLED        = ON
-- ASCENT_HIP_ENABLED         = OFF
-- ASCENT_OPENMP_ENABLED      = OFF
-- ASCENT_VTKH_ENABLED        = ON
-- ASCENT_CAMP_ENABLED        = TRUE
-- ASCENT_UMPIRE_ENABLED      = TRUE
-- ASCENT_RAJA_ENABLED        = 1
-- ASCENT_DRAY_ENABLED        = ON
-- ASCENT_APCOMP_ENABLED      = ON
-- ASCENT_OCCA_ENABLED        = 
-- ASCENT_BABELFLOW_ENABLED   = 
-- ASCENT_FIDES_ENABLED       = 
-- ASCENT_MFEM_ENABLED        = TRUE
-- ASCENT_MFEM_MPI_ENABLED    = FALSE
-- Ascent imported targets: ascent::ascent ascent::ascent_mpi

cyrush commented 1 year ago

@pgrete What kind of memory is being used to pass the mesh data?

If the are device-only pointers (instead of managed memory) that could cause an issue.

For fields, we can GPU pointers directly to Ascent, but other parts of the mesh need to be transformed on the GPU.

pgrete commented 1 year ago

advected is the data field to be plotted and it's a device pointer. So that should work as expected? @BenWibking suggested some off-by-one error that might also be causing the issue visible in #1100 I'll investigate (though it looks like I should first add nestsets info).

cyrush commented 1 year ago

Device to device for field data should work.

I am not sure how Kokkos annotates memory, but since it has detailed info (label="advected")

Kokkos::View ERROR: attempt to access inaccessible memory space (label="advected")

This leads me to believe the error happens while preparing the data to hand to ascent?

pgrete commented 1 year ago

I found the issue:

mesh["fields/" + varname + "/values"].set_external(&data(icomp, 0, 0, 0), ncells);

would try to access the indices from the host (even though the data View was a device one).

We now extract a subview first (which works on the host) and then extract the data pointer from the subview.

cyrush commented 1 year ago

@pgrete thanks for the updated info

Alpine-DAV / ascent

How to use Cuda? #1098