Open weipengyao opened 3 years ago
Dear @weipengyao,
I am studying your issue and so far I could not reproduce your problem.
If you use a supercomputer can you show me your launch script (the one you use to launch the simulation) or the exact configuration that you use (number of MPI tasks, OpenMP threads...)
Thank you
Dear @xxirii,
Thanks for your time and reply.
I checked again with the namelist attached and found that this error occurred (at 200 timestep) with 160 cores, but not with 40 cores (which might happen later).
I am running this on the supercomputer Niagara, and I use the smilei.sh 160 test.py
in the debug cluster, with 4 nodes (40 cores per node).
From the a.out.txt
, you may notice that I just use:
...
Initializing MPI
--------------------------------------------------------------------------------
MPI_THREAD_MULTIPLE enabled
Number of MPI process : 160
Number of patches :
dimension 0 - number_of_patches : 128
dimension 1 - number_of_patches : 128
Patch size :
dimension 0 - n_space : 20 cells.
dimension 1 - n_space : 20 cells.
Dynamic load balancing: never
OpenMP
--------------------------------------------------------------------------------
Number of thread per MPI process : 1
...
Let me know if you need anything else.
Best, Yao
Thank you, do you use any particular OMP environment variable like a specific SCHEDULER or thread placement?
I don't think I do.
Here's the script I use to compile Smilei on Niagara (hope it can help anyone else using Smilei on Niagara). compile_smilei_niagara.sh.txt
To save your time from downloading, it reads:
module purge
module load NiaEnv/2019b
module load intel/2019u3
module load intelmpi/2019u3
module load hdf5/1.10.5
module load python/3.6.8
export HDF5_ROOT_DIR=/scinet/niagara/software/2019b/modules/intel-2019u3-intelmpi-2019u3/hdf5/1.10.5
export PYTHONEXE=python3
export OMP_NUM_THREADS=1
export OMP_SCHEDULE=dynamic
export OMP_PROC_BIND=true
export OMPI_MCA_btl_portals4_use_rdma=0
# For MPI-tags:
export MPIR_CVAR_CH4_OFI_TAG_BITS=26
export MPIR_CVAR_CH4_OFI_RANK_BITS=13
I only have something 'special' for MPI-tag related issues (#307).
I checked my ~/.bashrc
, and I don't have anything related there.
Do you think I need to check any other possible places?
Thanks!
Thank you,
I have managed to reproduce the bug using exactly your configuration. It does not appear when you use a hybrid mode with more than 1 OpenMP thread per MPI. I will investigate but you should be able to run your case in hybrid if you need the results soon for science.
Moreover, in my case, I have a hdf5 issue when I use the variable debug_every
in the collision. So if you have the same you can comment it.
For instance, using 16 mpi tasks and 10 OpenMP threads per task I am at iteration 3700 after 8 minutes.
Dear @xxirii,
Thanks for the timely reply.
For me, I need to use ten times cores, i.e. 1600, with more particles like ppc=256
(in order to suppress the noise).
It seems that this crash appears when you use a core number above a certain value (and that explains the 16x10 scheme to be working).
About the debug_every
related hdf5 issue, I don't have it in my case, for now. But I remember when I try to use multiple species in the Collision a long time ago, there's a problem (see #307).
I hope it helps.
Right, it's surprising to see that it works with 159 MPI tasks and segfault with 160. Very strange.
Note that the bug only occurs when I use strictly 160 cores. When I use more it seems to work. Have you tried a case with more ppc and more MPI tasks that crashes?
Yes, I have. Please see this output file for example.
HEB2D_dep2_Inj128_Z10_T100_np1_Th1k_FixIon_SBC_Collee.py-4673320.out.txt
Description
I am using the injection module for hot electron transport in solid target.
When the temperature of the injected electron beam is high, like Te=100 keV, the code runs hundreds of steps and then crashes with the error
double free or corruption (out): 0x0000000003b5f6f0 ***
. While when reduce the temperature, e.g. Te=50 eV, the code runs fine (at least within the simulation time).Please find the related output files here:
a.out.txt test.py.txt a.err.txt
Steps to reproduce the problem
To reproduce the problem, just use the namelist above and compare the two cases with different temperatures.
And this info. about iterator validity might be helpful.
Parameters
make env
gives: