ECP-WarpX / WarpX

WarpX is an advanced electromagnetic & electrostatic Particle-In-Cell code.
https://ecp-warpx.github.io
Other
303 stars 191 forks source link

Out of bounds error in hybrid PIC #5398

Open kli-jfp opened 6 days ago

kli-jfp commented 6 days ago

I have been running the hybrid PIC code, and recently I have started to get an out of bounds error. I haven't been able to diagnose it fully yet so maybe someone could help, or perhaps someone is experiencing the same issue.

This is the error I am getting:

(2147483647,2147483647,2147483647,0) is out of bound (-3:98,-3:99,-3:67,0:0) amrex::Abort::0::CUDA error 700 in file /home/.../src/WarpX/build/_deps/fetchedamrex-src/Src/Base/AMReX_GpuDevice.cpp line 652: an illegal memory access was encountered !!! SIGABRT

In the Backtrace:

===== TinyProfilers ====== main() REG::WarpX::Evolve() WarpX::Evolve() WarpX::Evolve::step WarpX::HybridPICEvolveFields() WarpXParticleContainer::DepositCurrent::CurrentDeposition

Some more info:

  1. I have tried checking out an older WarpX commit (roughly 2 weeks ago) and the same error occurs
  2. I have removed embedded boundary and the same error occurs.
  3. I have checked all particle fields and initial grid fields so that they don't have NaN/Inf values, so the error seems to occur during the simulation. The fields and particles seem to have nothing out of the ordinary and the energies are stable, until the error occurs, seemingly random. Perhaps a particle leaves the domain and something is going wrong in the scraping?
aeriforme commented 5 days ago

Hello! Have you tried reducing the time step? If the time step is too large, there might be particles that travel a distance larger than 1 cell size in 1 time step, which would give the out of bound errors.

roelof-groenewald commented 5 days ago

Hey @kli-jfp. Thanks for raising this issue. Would you mind rebuilding WarpX in debug mode (using CMAKE_BUILD_TYPE=Debug) and rerunning. That should hopefully give a clearer error message.

@aeriforme's suggestion is a good one, but I would recommend just increasing the number of substeps used in the solver. That would help to suppress growth of Whistler waves without slowing down the simulation too much.

kli-jfp commented 4 days ago

Thanks for getting back to me @aeriforme and @roelof-groenewald.

Things I have now tried:

  1. Reducing time-step (dramatically). Does not work
  2. Increasing sub-steps. Does not work.
  3. Compiling and running in Debug mode. Below is the output

(2147483647,2147483647,2147483647,0) is out of bound (-3:98,-3:99,-3:67,0:0) /src/WarpX-p/build/_deps/fetchedamrex-src/Src/Base/AMReX.H:156: void amrex::Abort(const char *): block: [93,0,0], thread: [199,0,0] Assertion 0 failed. amrex::Abort::0::CUDA error 710 in file /src/WarpX-p/build/_deps/fetchedamrex-src/Src/Base/AMReX_GpuDevice.cpp line 652: device-side assert triggered !!! SIGABRT ===== TinyProfilers ====== main() REG::WarpX::Evolve() WarpX::Evolve() WarpX::Evolve::step WarpX::HybridPICEvolveFields() WarpXParticleContainer::DepositCurrent::CurrentDeposition

Not much more information to go off of. I could compile with more amrex debug flags but the code becomes incredibly slow. The simulation has no intial grid fields. B and E zero initially. I only inject particles at t=0 with velocity, position, and weight. I have checked the files and all particles have correct values (as far as I can tell) initially in the h5 file. One interesting thing I have noticed when running the simulation is that the B field does not evolve. It stays zero throughout the simulation, but the E field evolves. Perhaps the B field is being updated in the simulation but a zero B field is written to h5 files. I have checked out the newest dev version of WarpX as of today (2024-10-22).

aeriforme commented 4 days ago

Still very out of bounds!

Could you attach the input file and the backtraces, please?

roelof-groenewald commented 4 days ago

@kli-jfp thanks for trying with the higher time resolution. I believe the error is due to field values that become nan, which happens in the B-field advance due to unstable Whistler waves. Here is a small snippet describing the timestep constrain in Ohm's law from another code (not WarpX):

image

How do you treat the resistivity? Could you try using a higher resistivity value? And maybe adding a finite hyper-resistivity?

It is weird that the B-field is not being updated. Just to confirm you get field output from a step during the simulation (after step 1) but before the crash that shows zero B-field values in all components?

kli-jfp commented 4 days ago

Yes @aeriforme here are the files (input & Backtrace) Backtrace.txt input.txt The protons.h5 files is a maxwellian plasma blob in the center where all particles have T=1000eV (so that the error occurs faster, if the temp is lower the error still occurs but at later time-steps).

The same error can be procuded with protons.injection_style = NUniformPerCell instead of read-from-file. input_alt.txt hybrid_pic_model.elec_temp = 1000 hybrid_pic_model.n0_ref = 0.5e20 hybrid_pic_model.n_floor = 0.5e19

protons.charge = q_e protons.mass = m_p protons.injection_style = NUniformPerCell protons.num_particles_per_cell_each_dim = 2 2 2 protons.momentum_distribution_type = "gaussian" protons.ux_th = 0.001032370536575359 protons.uy_th = 0.001032370536575359 protons.uz_th = 0.001032370536575359 protons.profile = parse_density_function protons.density_function(x,y,z) = "if((x 2 + y 2 + z 2) 0.5 < 0.3, 0.5e20, 0.0)"

@roelof-groenewald Yes to clarify, if I don't set any initial B field, thus setting it to zero, all of the h5 files have a zero B field (all components) throughout the simulation. For some reason WarpX is either not evolving the B field, or the write process is writing a zero B field.

This error seems to occur for all types of parameters. I have set the resistivity/hyper-resistivity etc, but it always occurs. I have tried re-installing the warpx env many times... but perhaps there is some mismatch of versions going on.

Just to make sure, are these conda commands correct?

conda create -n warpx-dev -c conda-forge blaspp boost ccache cmake compilers git "heffte=*=mpi_mpich*" lapackpp "openpmd-api=*=mpi_mpich*" openpmd-viewer python make numpy pandas scipy yt "fftw=*=mpi_mpich*" pkg-config matplotlib mamba mpich mpi4py ninja pip virtualenv
conda install -c conda-forge libgomp
conda install -c nvidia -c conda-forge cuda cuda-nvtx-dev cupy

Edit: I have now installed everything fresh on a separate computer and the error is reproduced.

kli-jfp commented 2 days ago

I have now tested compiling with double precision and the "out of bounds" error has now disappeared (so far). The reduced diagnostics of the magnetic energy seems to have a bug. The magnetic energy in the diagnostics is negative, and not equal to the actual energy when checking the B field values manually.

roelof-groenewald commented 1 day ago

That's interesting that the issue seems to be related to the precision. Were you using single precision for both particles and fields before? I wonder if the problem still occurs when using double precision for the fields but single precision for the particles?

I'll take a look at the reduced diagnostic for the magnetic energy to see if I spot anything that looks like it would be problematic.