NEURON integrated tests failing with CUDA Unified Memory enabled

Describe the issue Some of the NEURON test are failing on GPU when CUDA Unified Memory is enabled in CoreNEURON. More precisely the tests that fail are:

The following tests FAILED:
         18 - coreneuron_modtests::direct_py (Failed)
         19 - coreneuron_modtests::direct_hoc (Failed)
         20 - coreneuron_modtests::spikes_py (Failed)
         21 - coreneuron_modtests::spikes_file_mode_py (Failed)
         22 - coreneuron_modtests::datareturn_py (Failed)
         25 - coreneuron_modtests::spikes_mpi_py (Failed)
         26 - coreneuron_modtests::spikes_mpi_file_mode_py (Failed)
         41 - testcorenrn_patstim::coreneuron_gpu_offline (Failed)
         45 - testcorenrn_patstim::compare_results (Failed)
         99 - testcorenrn_netstimdirect::direct (Failed)
        100 - testcorenrn_netstimdirect::compare_results (Failed)

To Reproduce Steps to reproduce the behavior:

git clone git@github.com:neuronsimulator/nrn.git
cd nrn
mkdir build_unified && cd build_unified
cmake .. -DCMAKE_INSTALL_PREFIX=./install -DNRN_ENABLE_INTERVIEWS=OFF -DNRN_ENABLE_RX3D=OFF -DNRN_ENABLE_CORENEURON=ON -DNRN_ENABLE_TESTS=ON -DCORENRN_ENABLE_GPU=ON -DCORENRN_ENABLE_CU
DA_UNIFIED_MEMORY=ON -DCORENRN_ENABLE_OPENMP=OFF
make -j16
ctest --output-on-failure

Expected behavior GPU tests should be passing with Unified Memory as well.

Logs An example of a failing test (coreneuron_modtests::direct_py) when run with cuda-memcheck has the following output:

========= Invalid __global__ read of size 8
=========     at 0x00000730 in /gpfs/bbp.cscs.ch/data/scratch/proj16/magkanar/psolve-direct/nrn_gpu/build_unified/test/nrnivmodl/8e220c327f2b8882adcf04884baa4209f37d0bbcef5677f046766f546d969ffd/x86_64/corenrn/mod2c/stim.cpp:410:coreneuron::_nrn_cur__IClamp_370_gpu(coreneuron::NrnThread*, coreneuron::Memb_list*, int)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x05121860 is out of bounds
=========     Device Frame:/gpfs/bbp.cscs.ch/data/scratch/proj16/magkanar/psolve-direct/nrn_gpu/build_unified/test/nrnivmodl/8e220c327f2b8882adcf04884baa4209f37d0bbcef5677f046766f546d969ffd/x86_64/corenrn/mod2c/stim.cpp:410:coreneuron::_nrn_cur__IClamp_370_gpu(coreneuron::NrnThread*, coreneuron::Memb_list*, int) (coreneuron::_nrn_cur__IClamp_370_gpu(coreneuron::NrnThread*, coreneuron::Memb_list*, int) : 0x730)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/lib64/libcuda.so (cuLaunchKernel + 0x34e) [0x2efa6e]
=========     Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/externals/2021-01-06/linux-rhel7-x86_64/gcc-9.3.0/nvhpc-21.2-67d2qp/Linux_x86_64/21.2/compilers/lib/libaccdevice.so (__pgi_uacc_cuda_launch3 + 0x1d94) [0x1ca43]
=========     Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/externals/2021-01-06/linux-rhel7-x86_64/gcc-9.3.0/nvhpc-21.2-67d2qp/Linux_x86_64/21.2/compilers/lib/libaccdevice.so [0x1d7a5]
=========     Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/externals/2021-01-06/linux-rhel7-x86_64/gcc-9.3.0/nvhpc-21.2-67d2qp/Linux_x86_64/21.2/compilers/lib/libaccdevice.so (__pgi_uacc_cuda_launch + 0x13d) [0x1d8e4]
=========     Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/externals/2021-01-06/linux-rhel7-x86_64/gcc-9.3.0/nvhpc-21.2-67d2qp/Linux_x86_64/21.2/compilers/lib/libacchost.so (__pgi_uacc_launch + 0x1f7) [0x463c0]
=========     Host Frame:./x86_64/special (_ZN10coreneuron16_nrn_cur__IClampEPNS_9NrnThreadEPNS_9Memb_listEi + 0x89b) [0x5702b]
=========     Host Frame:./x86_64/special [0x17f3bb]
=========     Host Frame:./x86_64/special (_ZN10coreneuron25setup_tree_matrix_minimalEPNS_9NrnThreadE + 0xe) [0x1814ae]
Failing in Thread:1
call to cuLaunchKernel returned error 719: Launch failed (often invalid pointer dereference)

The corresponding line that fails in the stim.cpp:

409:      #pragma acc atomic update
410:      _nt->nrn_fast_imem->nrn_sav_rhs[_nd_idx] += _rhs;
411:      #pragma acc atomic update
412:      _nt->nrn_fast_imem->nrn_sav_d[_nd_idx] -= _g;

System (please complete the following information)

OS: RedHat
Compiler: NVHPC 21.2
Version: master branch
Backend: GPU

BlueBrain / CoreNeuron

NEURON integrated tests failing with CUDA Unified Memory enabled #594