ComputationalRadiationPhysics / picongpu

Performance-Portable Particle-in-Cell Simulations for the Exascale Era :sparkles:
https://picongpu.readthedocs.io
Other
694 stars 218 forks source link

CUDA Error "Illegal instructions" #4548

Closed weqoll closed 1 year ago

weqoll commented 1 year ago

Hello everyone!

My program based on PIConGPU 0.5.0 sometimes get error message like this during calculations:

Warning: using existing folder on user-request [-f]
Running program...
==> Error: Spec 'picongpu@0.5.0%gcc@7.3.0~adios+hdf5~isaac+png backend=cuda cudacxx=nvcc arch=linux-ubuntu20.04-skylake_avx512 ^autoconf@2.69%gcc@7.3.0 arch=linux-ubuntu20.04-skylake_avx512 ^automake@1.16.3%gcc@7.3.0 arch=linux-ubuntu20.04-skylake_avx512 ^berkeley-db@18.1.40%gcc@7.3.0 arch=linux-ubuntu20.04-skylake_avx512 ^boost@1.70.0%gcc@7.3.0+atomic+chrono~clanglibcpp~container~context~coroutine+date_time~debug+exception~fiber+filesystem+graph~icu+iostreams+locale+log+math~mpi+multithreaded~numpy~pic+program_options~python+random+regex+serialization+shared+signals~singlethreaded+system~taggedlayout+test+thread+timer~versionedlayout+wave cxxstd=11 visibility=hidden arch=linux-ubuntu20.04-skylake_avx512 ^bzip2@1.0.8%gcc@7.3.0+shared arch=linux-ubuntu20.04-skylake_avx512 ^cmake@3.19.2%gcc@7.3.0~doc+ncurses+openssl+ownlibs~qt arch=linux-ubuntu20.04-skylake_avx512 ^cuda@10.2.89%gcc@7.3.0 arch=linux-ubuntu20.04-skylake_avx512 ^diffutils@3.7%gcc@7.3.0 arch=linux-ubuntu20.04-skylake_avx512 ^freetype@2.10.1%gcc@7.3.0 arch=linux-ubuntu20.04-skylake_avx512 ^gdbm@1.18.1%gcc@7.3.0 arch=linux-ubuntu20.04-skylake_avx512 ^hdf5@1.10.7%gcc@7.3.0~cxx~debug~fortran~hl~java+mpi+pic+shared~szip~threadsafe api=none arch=linux-ubuntu20.04-skylake_avx512 ^hwloc@2.4.0%gcc@7.3.0~cairo~cuda~gl~libudev+libxml2~netloc~nvml+pci+shared arch=linux-ubuntu20.04-skylake_avx512 ^libevent@2.1.12%gcc@7.3.0+openssl arch=linux-ubuntu20.04-skylake_avx512 ^libiconv@1.16%gcc@7.3.0 arch=linux-ubuntu20.04-skylake_avx512 ^libpciaccess@0.16%gcc@7.3.0 arch=linux-ubuntu20.04-skylake_avx512 ^libpng@1.6.37%gcc@7.3.0 arch=linux-ubuntu20.04-skylake_avx512 ^libsigsegv@2.12%gcc@7.3.0 arch=linux-ubuntu20.04-skylake_avx512 ^libsplash@1.7.0%gcc@7.3.0~ipo+mpi build_type=RelWithDebInfo patches=669608721dfce0ada7cef1ac84344352791a8916b7bb98ca8a0d4e6d4670e744 arch=linux-ubuntu20.04-skylake_avx512 ^libtool@2.4.6%gcc@7.3.0 arch=linux-ubuntu20.04-skylake_avx512 ^libxml2@2.9.10%gcc@7.3.0~python arch=linux-ubuntu20.04-skylake_avx512 ^lz4@1.9.2%gcc@7.3.0 arch=linux-ubuntu20.04-skylake_avx512 ^m4@1.4.18%gcc@7.3.0+sigsegv patches=3877ab548f88597ab2327a2230ee048d2d07ace1062efe81fc92e91b7f39cd00,fc9b61654a3ba1a8d6cd78ce087e7c96366c290bc8d2c299f09828d793b853c8 arch=linux-ubuntu20.04-skylake_avx512 ^ncurses@6.2%gcc@7.3.0~symlinks+termlib arch=linux-ubuntu20.04-skylake_avx512 ^numactl@2.0.14%gcc@7.3.0 patches=4e1d78cbbb85de625bad28705e748856033eaafab92a66dffd383a3d7e00cc94 arch=linux-ubuntu20.04-skylake_avx512 ^openmpi@4.0.5%gcc@7.3.0~atomics~cuda~cxx~cxx_exceptions+gpfs~java~legacylaunchers~lustre~memchecker~pmi~singularity~sqlite3+static~thread_multiple+vt+wrapper-rpath fabrics=none schedulers=none arch=linux-ubuntu20.04-skylake_avx512 ^openssl@1.1.1i%gcc@7.3.0+systemcerts arch=linux-ubuntu20.04-skylake_avx512 ^perl@5.32.0%gcc@7.3.0+cpanm+shared+threads arch=linux-ubuntu20.04-skylake_avx512 ^pkgconf@1.7.3%gcc@7.3.0 arch=linux-ubuntu20.04-skylake_avx512 ^pngwriter@0.7.0%gcc@7.3.0~ipo build_type=RelWithDebInfo arch=linux-ubuntu20.04-skylake_avx512 ^popt@1.16%gcc@7.3.0 arch=linux-ubuntu20.04-skylake_avx512 ^readline@8.0%gcc@7.3.0 arch=linux-ubuntu20.04-skylake_avx512 ^rsync@3.2.2%gcc@7.3.0 arch=linux-ubuntu20.04-skylake_avx512 ^util-macros@1.19.1%gcc@7.3.0 arch=linux-ubuntu20.04-skylake_avx512 ^xxhash@0.7.4%gcc@7.3.0 arch=linux-ubuntu20.04-skylake_avx512 ^xz@5.2.5%gcc@7.3.0~pic arch=linux-ubuntu20.04-skylake_avx512 ^zlib@1.2.11%gcc@7.3.0+optimize+pic+shared arch=linux-ubuntu20.04-skylake_avx512 ^zstd@1.4.5%gcc@7.3.0+pic arch=linux-ubuntu20.04-skylake_avx512' matches no installed packages.
PIConGPU: 0.5.0
  Build-Type: Release

Third party:
  OS:         Linux-5.15.0-69-generic
  arch:       x86_64
  CXX:        GNU (7.3.0)
  CMake:      3.19.2
  CUDA:       10.2.89
  mallocMC:   2.3.1
  Boost:      1.70.0
  MPI:        
    standard: 3.1
    flavor:   OpenMPI (4.0.5)
  PNGwriter:  0.7.0
  libSplash:  1.7.0 (Format 4.0)
  ADIOS:      NOTFOUND
Dimension y: Local grid size is not a multiple of supercell size. Auto adjust from 1500 to 1504
Dimension y: Local grid size is not a multiple of supercell size. Auto adjust from 1500 to 1504
Dimension y: Local grid size is not a multiple of supercell size. Auto adjust from 1500 to 1504
Dimension y: Local grid size is not a multiple of supercell size. Auto adjust from 1500 to 1504
 new grid size (global|local|offset): {6000,6016}|{6000,1504}|{0,1504}
Dimension y: Invalid global grid size. Auto adjust from 6000 to 6016
 new grid size (global|local|offset): {6000,6016}|{6000,1504}|{0,0}
PIConGPUVerbose PHYSICS(1) | Sliding Window is OFF
 new grid size (global|local|offset): {6000,6016}|{6000,1504}|{0,3008}
 new grid size (global|local|offset): {6000,6016}|{6000,1504}|{0,4512}
PIConGPUVerbose PHYSICS(1) | used Random Number Generator: RNGProvider2XorMin seed: 42
PIConGPUVerbose PHYSICS(1) | Courant c*dt <= 1.17933 ? 1
PIConGPUVerbose PHYSICS(1) | Resolving plasma oscillations?
   Estimates are based on DensityRatio to BASE_DENSITY of each species
   (see: density.param, speciesDefinition.param).
   It and does not cover other forms of initialization
PIConGPUVerbose PHYSICS(1) | species e: omega_p * dt <= 0.1 ? 0.0292432
PIConGPUVerbose PHYSICS(1) | species i: omega_p * dt <= 0.1 ? 0.000108403
PIConGPUVerbose PHYSICS(1) | y-cells per wavelength: 78
PIConGPUVerbose PHYSICS(1) | macro particles per device: 36096000
PIConGPUVerbose PHYSICS(1) | typical macro particle weighting: 1679.38
PIConGPUVerbose PHYSICS(1) | UNIT_SPEED 2.99792e+08
PIConGPUVerbose PHYSICS(1) | UNIT_TIME 1e-16
PIConGPUVerbose PHYSICS(1) | UNIT_LENGTH 2.99792e-08
PIConGPUVerbose PHYSICS(1) | UNIT_MASS 1.52981e-27
PIConGPUVerbose PHYSICS(1) | UNIT_CHARGE 2.69065e-16
PIConGPUVerbose PHYSICS(1) | UNIT_EFIELD 1.70451e+13
PIConGPUVerbose PHYSICS(1) | UNIT_BFIELD 56856.3
PIConGPUVerbose PHYSICS(1) | UNIT_ENERGY 1.37492e-10
initialization time: 37sec 995msec = 37 sec
  0 % =        0 | time elapsed:                   69msec | avg time per step:   0msec
  5 % =      600 | time elapsed:             7sec 492msec | avg time per step:  12msec
 10 % =     1200 | time elapsed:            14sec 912msec | avg time per step:  12msec
 15 % =     1800 | time elapsed:            22sec 328msec | avg time per step:  12msec
 20 % =     2400 | time elapsed:            29sec 754msec | avg time per step:  12msec
 25 % =     3000 | time elapsed:            37sec 173msec | avg time per step:  12msec
 30 % =     3600 | time elapsed:            44sec 597msec | avg time per step:  12msec
 35 % =     4200 | time elapsed:            52sec  26msec | avg time per step:  12msec
 40 % =     4800 | time elapsed:            59sec 452msec | avg time per step:  12msec
 45 % =     5400 | time elapsed:       1min  6sec 882msec | avg time per step:  12msec
 50 % =     6000 | time elapsed:       1min 14sec 307msec | avg time per step:  12msec
 55 % =     6600 | time elapsed:       1min 21sec 739msec | avg time per step:  12msec
 60 % =     7200 | time elapsed:       1min 29sec 179msec | avg time per step:  12msec
 65 % =     7800 | time elapsed:       1min 36sec 612msec | avg time per step:  12msec
 70 % =     8400 | time elapsed:       1min 44sec 107msec | avg time per step:  12msec
 75 % =     9000 | time elapsed:       1min 51sec 627msec | avg time per step:  12msec
 80 % =     9600 | time elapsed:       2min  6sec 177msec | avg time per step:  12msec
 85 % =    10200 | time elapsed:       2min 27sec 441msec | avg time per step:  12msec
Unhandled exception of type 'St13runtime_error' with message '/home/astashkin/src/spack/opt/spack/linux-ubuntu20.04-skylake_avx512/gcc-7.3.0/picongpu-0.5.0-amn6iosbrogl6vfaxdk3cvedhdl2lv7p/thirdParty/alpaka/include/alpaka/event/EventCudaRt.hpp(183) 'ret = cudaEventQuery( event.m_spEventImpl->m_CudaEvent)' returned error  : 'cudaErrorIllegalInstruction': 'an illegal instruction was encountered'!', terminating

Main issue with the lowest line of this message about encountered illegal instructions. This error reproduces randomly. There are no common things between the calculations in which I came across the error.

Firstly I headed with this one during long-time calculations, so I thought about hardware issues with temperature of my GPUs. Trying to debug this thing I can't get any successful results in discovering root of my issue. Moreover, now this error reproduces in fast calculations, such as displayed above.

GPUs state during the terminating:

$ nvidia-smi
Mon May  1 21:06:31 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.182.03   Driver Version: 470.182.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:1B:00.0 Off |                  N/A |
| 25%   43C    P8    21W / 250W |  10671MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:1C:00.0 Off |                  N/A |
| 18%   38C    P8    20W / 250W |  10671MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce ...  Off  | 00000000:1D:00.0 Off |                  N/A |
| 35%   52C    P2    63W / 250W |  10671MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce ...  Off  | 00000000:1E:00.0 Off |                  N/A |
| 24%   42C    P8    19W / 250W |  10671MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA GeForce ...  Off  | 00000000:3D:00.0 Off |                  N/A |
| 16%   36C    P8     2W / 250W |      8MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA GeForce ...  Off  | 00000000:3F:00.0 Off |                  N/A |
| 16%   37C    P8    20W / 250W |      8MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA GeForce ...  Off  | 00000000:40:00.0 Off |                  N/A |
| 17%   36C    P8    22W / 250W |      8MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA GeForce ...  Off  | 00000000:41:00.0 Off |                  N/A |
| 16%   39C    P8     7W / 250W |      8MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1958      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A     28664      C   ....curr1/input/bin/picongpu    10663MiB |
|    1   N/A  N/A      1958      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A     28665      C   ....curr1/input/bin/picongpu    10663MiB |
|    2   N/A  N/A      1958      G   /usr/lib/xorg/Xorg                  4MiB |
|    2   N/A  N/A     28666      C   ....curr1/input/bin/picongpu    10663MiB |
|    3   N/A  N/A      1958      G   /usr/lib/xorg/Xorg                  4MiB |
|    3   N/A  N/A     28667      C   ....curr1/input/bin/picongpu    10663MiB |
|    4   N/A  N/A      1958      G   /usr/lib/xorg/Xorg                  4MiB |
|    5   N/A  N/A      1958      G   /usr/lib/xorg/Xorg                  4MiB |
|    6   N/A  N/A      1958      G   /usr/lib/xorg/Xorg                  4MiB |
|    7   N/A  N/A      1958      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+

Main thing that I can get is probabilty of memory overflow. However, three identical calculations I launched before and they ended successfully. Is there something about memory leaks?

My program calculate the interaction within laser pulse and neutral argon gas with ionization during pulse propagation. Ionization implemented in PIC-code with ADKLin model. PIConGPU version is 0.5.0

Could you help me with this one? Maybe you have some experience with debugging such issues. Thanks for your help in advance!

Best regards, Egor Astashkin

psychocoderHPC commented 1 year ago

Dear @weqoll 0.6.0 0.5.0 is an ancient version, I suggest testing at least 0.6.0 or the current dev branch. The dev will soon be released as 0.7.0. To switch the version mostly minor param changes are required. Note that we do not support tat the moment to compile PIConGPU with spack, you can still manage your dependencies but we are currently not updating our spack recipe.

A possible problem with why this error is showing up could be that you compile for the wrong compute architecture. A driver issue is possible too. Running out of memory is possible too because you run the Xserver on all your GPUs too. If the Xserver is active all kernels running longer than 13 seconds under Linux will be killed which is crashing the simulation. I suggest disabling the Xserver and using this machine via a terminal without a GUI only. The XServer is the most likely root of your issue and could explain why some simulations passed and other crashing.

BrianMarre commented 1 year ago

@weqoll any status updates? Did you encounter further issues? Did it work? If the suggestions by @psychocoderHPC resolved your problems, please remember to close the issue ;), thanks!

weqoll commented 1 year ago

Sorry about holding this issue in such state, I wasn't able to work it around for some time. I'll reopen this issue when such if necessary.