AMReX-Astro / Castro

Castro (Compressible Astrophysics): An adaptive mesh, astrophysical compressible (radiation-, magneto-) hydrodynamics simulation code for massively parallel CPU and GPU architectures.
http://amrex-astro.github.io/Castro
Other
293 stars 99 forks source link

illegal memory access error subchandra on CUDA #2818

Closed zhichen3 closed 2 months ago

zhichen3 commented 3 months ago

I'm getting cuda errors on the very first step of the subchandra problem. CUDA error 700 in file /home/zhi/github/amrex/Src/Base/AMReX_GpuDevice.cpp line 614: an illegal memory access was encountered

To reproduce, compile subchandra with make -f GNUmakefile.nse_net USE_CUDA=TRUE USE_SIMPLIFIED_SDC=TRUE NETWORK_DIR=subch_base

With Backtrace:

1: ./Castro2d.gnu.DEBUG.MPI.CUDA.SMPLSDC.ex() [0x8970da]
    _ZN5amrex11BLBackTrace7handlerEi
/global/homes/z/zhichen/Github/amrex/Src/Base/AMReX_BLBackTrace.cpp:99:7

 2: ./Castro2d.gnu.DEBUG.MPI.CUDA.SMPLSDC.ex() [0x69f12c]
    _ZN5amrex18ParallelDescriptor5AbortEib
/global/homes/z/zhichen/Github/amrex/Src/Base/AMReX_ParallelDescriptor.cpp:219:21

 3: ./Castro2d.gnu.DEBUG.MPI.CUDA.SMPLSDC.ex() [0x627918]
    _ZN5amrex10Error_hostEPKcS1_
/global/homes/z/zhichen/Github/amrex/Src/Base/AMReX.cpp:243:1

 4: ./Castro2d.gnu.DEBUG.MPI.CUDA.SMPLSDC.ex() [0x627857]
    _ZN5amrex5AbortEPKc inlined at /global/homes/z/zhichen/Github/amrex/Src/Base/AMReX.cpp:214:6 in _ZN5amrex5AbortERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
/global/homes/z/zhichen/Github/amrex/Src/Base/AMReX.H:156:1
_ZN5amrex5AbortERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
/global/homes/z/zhichen/Github/amrex/Src/Base/AMReX.cpp:214:6

 5: ./Castro2d.gnu.DEBUG.MPI.CUDA.SMPLSDC.ex() [0x6fbec8]
    _ZN5amrex3Gpu6Device17streamSynchronizeEv
/global/homes/z/zhichen/Github/amrex/Src/Base/AMReX_GpuDevice.cpp:613:464

 6: ./Castro2d.gnu.DEBUG.MPI.CUDA.SMPLSDC.ex() [0x430792]
    _ZN5amrex3Gpu17streamSynchronizeEv
/global/homes/z/zhichen/Github/amrex/Src/Base/AMReX_GpuDevice.H:242:1

 7: ./Castro2d.gnu.DEBUG.MPI.CUDA.SMPLSDC.ex() [0x805178]
    _ZN5amrex6MFIter8FinalizeEv
/global/homes/z/zhichen/Github/amrex/Src/Base/AMReX_MFIter.cpp:240:1

 8: ./Castro2d.gnu.DEBUG.MPI.CUDA.SMPLSDC.ex() [0x8050e0]
    _ZN5amrex6MFIterD2Ev
/global/homes/z/zhichen/Github/amrex/Src/Base/AMReX_MFIter.cpp:213:1

 9: ./Castro2d.gnu.DEBUG.MPI.CUDA.SMPLSDC.ex() [0x6200d1]
    _ZN6Castro11react_stateEdd
/global/homes/z/zhichen/Github/Castro/Source/reactions/Castro_react.cpp:816:39

10: ./Castro2d.gnu.DEBUG.MPI.CUDA.SMPLSDC.ex() [0x61ee11]
    _ZN6Castro16do_new_reactionsEdd
/global/homes/z/zhichen/Github/Castro/Source/reactions/Castro_react.cpp:74:33

11: ./Castro2d.gnu.DEBUG.MPI.CUDA.SMPLSDC.ex() [0x5b5d41]
    _ZN6Castro22post_advance_operatorsEdd
/global/homes/z/zhichen/Github/Castro/Source/sources/Castro_sources.cpp:618:44

12: ./Castro2d.gnu.DEBUG.MPI.CUDA.SMPLSDC.ex() [0x497d36]
    _ZN6Castro14do_advance_ctuEdd
/global/homes/z/zhichen/Github/Castro/Source/driver/Castro_advance_ctu.cpp:122:68

13: ./Castro2d.gnu.DEBUG.MPI.CUDA.SMPLSDC.ex() [0x498b60]
    _ZN6Castro20subcycle_advance_ctuEddii
/global/homes/z/zhichen/Github/Castro/Source/driver/Castro_advance_ctu.cpp:391:60

14: ./Castro2d.gnu.DEBUG.MPI.CUDA.SMPLSDC.ex() [0x490bb0]
    _ZN6Castro7advanceEddii
/global/homes/z/zhichen/Github/Castro/Source/driver/Castro_advance.cpp:69:53

15: ./Castro2d.gnu.DEBUG.MPI.CUDA.SMPLSDC.ex() [0xa15604]
    _ZN5amrex3Amr8timeStepEidiid
/global/homes/z/zhichen/Github/amrex/Src/Amr/AMReX_Amr.cpp:2022:44

16: ./Castro2d.gnu.DEBUG.MPI.CUDA.SMPLSDC.ex() [0xa15ddb]
    _ZN5amrex3Amr14coarseTimeStepEd
/global/homes/z/zhichen/Github/amrex/Src/Amr/AMReX_Amr.cpp:2133:26

17: ./Castro2d.gnu.DEBUG.MPI.CUDA.SMPLSDC.ex() [0x4cd009]
    main
/global/homes/z/zhichen/Github/Castro/Source/driver/main.cpp:165:29

18: /lib64/libc.so.6(__libc_start_main+0xef) [0x7fe40ea3e24d]

19: ./Castro2d.gnu.DEBUG.MPI.CUDA.SMPLSDC.ex() [0x40d12a]
    _start
../sysdeps/x86_64/start.S:122
zhichen3 commented 3 months ago

I'm getting CUDA Exception: Lane User Stack Overflow when evaluating *(d_num_failed.copyToHost()) line 814 in castro_react.cpp in cuda-gdb

zingale commented 3 months ago

when I link, I get this message:

Stack size for entry function '_ZN5amrex13launch_globalILi256EZNS_6launchILi256EZNS_9ReduceOpsIJNS_11ReduceOpMinEEE4evalINS_10ReduceDataIJNS_10ValLocPairIdNS_7IntVectEEEEEEZNS4_4evalINS_8FabArrayINS_9FArrayBoxEEESA_ZN6Castro13estdt_burningEiEUliiiiE_EENSt9enable_ifIXsr5amrex10IsFabArrayIT_vEE5valueEvE4typeERKSI_RKS8_RT0_OT1_EUliiiE_EEvRKNS_3BoxERSI_RKSP_EUlvE_EEvimP11CUstream_stRKT0_EUlvE_EEvS13_' cannot be statically determined

so the compiler is telling us there is something up in that function

yut23 commented 3 months ago

I'm able to reproduce this on my workstation with inputs.N14.coarse (I don't have enough memory for the others).

zingale commented 2 months ago

fixed by eliminating recursion