SmileiPIC / Smilei

Particle-in-cell code for plasma simulation
https://smileipic.github.io/Smilei
335 stars 119 forks source link

Simulation does not run #291

Closed juliencelia closed 3 years ago

juliencelia commented 4 years ago

Dear Smilei experts,

Hope all of you are fine!

I have a simulation that does not begin. I am afraid of having a too big simulation : 25000 * 20000 cells in 2D but I am not sure. The error message is:

Invalid knl_memoryside_cache header, expected "version: 1". [irene3354][[26206,0],315][btl_portals4_component.c:1115] mca_btl_portals4_component_progress_event() ERROR 0: PTL_EVENT_ACK with ni_fail_type 10 (PTL_NI_TARGET_INVALID) with target (nid=508,pid=73) and initator (nid=507,pid=73) found Stack trace (most recent call last):

14 Object "[0xffffffffffffffff]", at 0xffffffffffffffff, in

13 Object "./smileiKNL", at 0x458568, in

12 Object "/lib64/libc.so.6", at 0x2b3e9d86f544, in __libc_start_main

11 Object "./smileiKNL", at 0x8f379f, in main

10 Object "./smileiKNL", at 0x6e93ab, in Params::Params(SmileiMPI*, std::vector<std::string, std::allocator >)

9 Object "/opt/selfie-1.0.2/lib64/selfie.so", at 0x2b3e9b907ab7, in MPI_Barrier

8 Object "/ccc/products/openmpi-2.0.4/intel--17.0.6.256/default/lib/libmpi.so.20", at 0x2b3e9ccdaea0, in MPI_Barrier

7 Object "/ccc/products/openmpi-2.0.4/intel--17.0.6.256/default/lib/libmpi.so.20", at 0x2b3e9cd15a82, in ompi_coll_base_barrier_intra_bruck

6 Object "/opt/mpi/openmpi-icc/2.0.4.5.10.xcea/lib/openmpi/mca_pml_ob1.so", at 0x2b3ea7b527a6, in mca_pml_ob1_send

5 Object "/opt/mpi/openmpi-icc/2.0.4.5.10.xcea/lib/libopen-pal.so.20", at 0x2b3e9ff69330, in opal_progress

4 Object "/opt/mpi/openmpi-icc/2.0.4.5.10.xcea/lib/openmpi/mca_btl_portals4.so", at 0x2b3ea5fd384d, in mca_btl_portals4_component_progress

3 Object "/opt/mpi/openmpi-icc/2.0.4.5.10.xcea/lib/openmpi/mca_btl_portals4.so", at 0x2b3ea5fd3a59, in mca_btl_portals4_component_progress_event

2 Object "/opt/mpi/openmpi-icc/2.0.4.5.10.xcea/lib/libmca_common_portals4.so.20", at 0x2b3ea61defd8, in common_ptl4_printf_error

1 Object "/lib64/libc.so.6", at 0x2b3e9d884a67, in abort

0 Object "/lib64/libc.so.6", at 0x2b3e9d883377, in gsignal

Aborted (Signal sent by tkill() 150381 35221)

The simulation stops at : HDF5 version 1.8.20 Python version 2.7.14 Parsing pyinit.py Parsing v4.4-706-gb5c12a5a-master Parsing pyprofiles.py Parsing BNH2d.py Parsing pycontrol.py Check for function preprocess() python preprocess function does not exist

The version of Smilei is : v4.4-706-gb5c12a5a-master

Thanks for your help. Here is the input:

BNH2d.txt

jderouillat commented 3 years ago

This environment does not try to load a libmpi.so.20 but a libmpi.so.40.
Can you post the result of ldd smilei ?

juliencelia commented 3 years ago

/ccc/work/cont003/ra5390/bonvalej/Smilei/Smilei_hub/smileirome: error while loading shared libraries: libhdf5.so.10: cannot open shared object file: No such file or directory

jderouillat commented 3 years ago

I can't access your directory (ask to the hotline to add my login to your project if you want), and even if I could, the result depends of the environment set when the command is executed.
Could you answer to the question ? If the question is not clear, tell me.

juliencelia commented 3 years ago

j'ai ceci pour le binaire généré par la hotline:

linux-vdso.so.1 =>  (0x00007ffe715cc000)
libhdf5.so.10 => not found
libpython3.7m.so.1.0 => not found
libm.so.6 => /lib64/libm.so.6 (0x00002b989a4cd000)
libmpi_cxx.so.40 => /ccc/products/openmpi-4.0.2/intel--19.0.5.281/default/lib/libmpi_cxx.so.40 (0x00002b989a7cf000)
libmpi.so.40 => /ccc/products/openmpi-4.0.2/intel--19.0.5.281/default/lib/libmpi.so.40 (0x00002b989a9eb000)
libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00002b989ad26000)
libiomp5.so => /ccc/products/ifort-19.0.5.281/system/default/19.0.5.281/lib/intel64/libiomp5.so (0x00002b989b02d000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00002b989b422000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00002b989b638000)
libc.so.6 => /lib64/libc.so.6 (0x00002b989b854000)
libdl.so.2 => /lib64/libdl.so.2 (0x00002b989bc22000)
/lib64/ld-linux-x86-64.so.2 (0x00002b989a2a9000)
libopen-rte.so.40 => /ccc/products/openmpi-4.0.2/intel--19.0.5.281/default/lib/libopen-rte.so.40 (0x00002b989be26000)
libopen-pal.so.40 => /ccc/products/openmpi-4.0.2/intel--19.0.5.281/default/lib/libopen-pal.so.40 (0x00002b989c0eb000)
librt.so.1 => /lib64/librt.so.1 (0x00002b989c3b0000)
libutil.so.1 => /lib64/libutil.so.1 (0x00002b989c5b8000)
libz.so.1 => /lib64/libz.so.1 (0x00002b989c7bb000)
libhwloc.so.15 => /ccc/products/hwloc-2.0.4/system/default/lib/libhwloc.so.15 (0x00002b989c9d1000)
libudev.so.1 => /lib64/libudev.so.1 (0x00002b989cc1c000)
libpciaccess.so.0 => /lib64/libpciaccess.so.0 (0x00002b989ce32000)
libxml2.so.2 => /lib64/libxml2.so.2 (0x00002b989d03c000)
libevent-2.0.so.5 => /lib64/libevent-2.0.so.5 (0x00002b989d3a6000)
libevent_pthreads-2.0.so.5 => /lib64/libevent_pthreads-2.0.so.5 (0x00002b989d5ee000)
libimf.so => /ccc/products/ifort-19.0.5.281/system/default/19.0.5.281/lib/intel64/libimf.so (0x00002b989d7f1000)
libirng.so => /ccc/products/ifort-19.0.5.281/system/default/19.0.5.281/lib/intel64/libirng.so (0x00002b989de76000)
libcilkrts.so.5 => /ccc/products/ifort-19.0.5.281/system/default/19.0.5.281/lib/intel64/libcilkrts.so.5 (0x00002b989e1e1000)
libintlc.so.5 => /ccc/products/ifort-19.0.5.281/system/default/19.0.5.281/lib/intel64/libintlc.so.5 (0x00002b989e41e000)
libsvml.so => /ccc/products/ifort-19.0.5.281/system/default/19.0.5.281/lib/intel64/libsvml.so (0x00002b989e690000)
libcap.so.2 => /lib64/libcap.so.2 (0x00002b98a011c000)
libdw.so.1 => /lib64/libdw.so.1 (0x00002b98a0321000)
liblzma.so.5 => /lib64/liblzma.so.5 (0x00002b98a0572000)
libattr.so.1 => /lib64/libattr.so.1 (0x00002b98a0798000)
libelf.so.1 => /lib64/libelf.so.1 (0x00002b98a099d000)
libbz2.so.1 => /lib64/libbz2.so.1 (0x00002b98a0bb5000)

Et pour ma version : ldd smileirome linux-vdso.so.1 => (0x00007ffeb3483000) libhdf5.so.10 => not found libpython2.7.so.1.0 => /lib64/libpython2.7.so.1.0 (0x00002b77bbd5e000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00002b77bc12a000) libdl.so.2 => /lib64/libdl.so.2 (0x00002b77bc346000) libutil.so.1 => /lib64/libutil.so.1 (0x00002b77bc54a000) libm.so.6 => /lib64/libm.so.6 (0x00002b77bc74d000) libmpi_cxx.so.40 => /ccc/products/openmpi-4.0.2/intel--19.0.5.281/default/lib/libmpi_cxx.so.40 (0x00002b77bca4f000) libmpi.so.40 => /ccc/products/openmpi-4.0.2/intel--19.0.5.281/default/lib/libmpi.so.40 (0x00002b77bcc6b000) libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00002b77bcfa6000) libiomp5.so => /ccc/products/ifort-19.0.5.281/system/default/19.0.5.281/lib/intel64/libiomp5.so (0x00002b77bd2ad000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00002b77bd6a2000) libc.so.6 => /lib64/libc.so.6 (0x00002b77bd8b8000) /lib64/ld-linux-x86-64.so.2 (0x00002b77bbb3a000) libopen-rte.so.40 => /ccc/products/openmpi-4.0.2/intel--19.0.5.281/default/lib/libopen-rte.so.40 (0x00002b77bdc86000) libopen-pal.so.40 => /ccc/products/openmpi-4.0.2/intel--19.0.5.281/default/lib/libopen-pal.so.40 (0x00002b77bdf4b000) librt.so.1 => /lib64/librt.so.1 (0x00002b77be210000) libz.so.1 => /lib64/libz.so.1 (0x00002b77be418000) libhwloc.so.15 => /ccc/products/hwloc-2.0.4/system/default/lib/libhwloc.so.15 (0x00002b77be62e000) libudev.so.1 => /lib64/libudev.so.1 (0x00002b77be879000) libpciaccess.so.0 => /lib64/libpciaccess.so.0 (0x00002b77bea8f000) libxml2.so.2 => /lib64/libxml2.so.2 (0x00002b77bec99000) libevent-2.0.so.5 => /lib64/libevent-2.0.so.5 (0x00002b77bf003000) libevent_pthreads-2.0.so.5 => /lib64/libevent_pthreads-2.0.so.5 (0x00002b77bf24b000) libimf.so => /ccc/products/ifort-19.0.5.281/system/default/19.0.5.281/lib/intel64/libimf.so (0x00002b77bf44e000) libirng.so => /ccc/products/ifort-19.0.5.281/system/default/19.0.5.281/lib/intel64/libirng.so (0x00002b77bfad3000) libcilkrts.so.5 => /ccc/products/ifort-19.0.5.281/system/default/19.0.5.281/lib/intel64/libcilkrts.so.5 (0x00002b77bfe3e000) libintlc.so.5 => /ccc/products/ifort-19.0.5.281/system/default/19.0.5.281/lib/intel64/libintlc.so.5 (0x00002b77c007b000) libsvml.so => /ccc/products/ifort-19.0.5.281/system/default/19.0.5.281/lib/intel64/libsvml.so (0x00002b77c02ed000) libcap.so.2 => /lib64/libcap.so.2 (0x00002b77c1d79000) libdw.so.1 => /lib64/libdw.so.1 (0x00002b77c1f7e000) liblzma.so.5 => /lib64/liblzma.so.5 (0x00002b77c21cf000) libattr.so.1 => /lib64/libattr.so.1 (0x00002b77c23f5000) libelf.so.1 => /lib64/libelf.so.1 (0x00002b77c25fa000) libbz2.so.1 => /lib64/libbz2.so.1 (0x00002b77c2812000)

jderouillat commented 3 years ago

You need to reinstall mpi4py in the targeted environment.
(You can also to try to do without, the problem that you observed on KNL could be less critical on a more classical architecture).

juliencelia commented 3 years ago

It seems to work now ;) I am happi! Just a general question: what smilei is doing during "parsing input.py"?

mccoys commented 3 years ago

Just a general question: what smilei is doing during "parsing input.py"?

It reads the namelist !

jderouillat commented 3 years ago

Great !

More precisely, it runs the namelist as a Python script. In your case, it reads the hydro file and interpolates the read quantities.
It can take times.

juliencelia commented 3 years ago

Yes it seems to be long. Wait and see. Actually, smilei works on IRENE. The env used is:

module purge module load intel/19.0.5.281 module load mpi/openmpi/4.0.2 module load flavor/hdf5/parallel hdf5/1.8.20 export HDF5_ROOT_DIR=${HDF5_ROOT} export PYTHONEXE=${PYTHON3_EXEDIR} module load python3/3.7.5

To compile, I put the no_mpi_tm config option as you advice.

To use Scipy, before ccc_mprun hotline added : export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$PYTHON3_ROOT/lib

juliencelia commented 3 years ago

I am afraid that I have again a problem: in my out, I am locked on "python preprocess function does not exist" since 1hour... Could it be normal?

jderouillat commented 3 years ago

You could try to use the binary I compiled with a script inspired by mine especially concerning module used (with a mpi4py installed in this environment).
Both are available in /ccc/work/cont003/smilei/derouilj/Issue291.

You will find too in this directory a namelist derived from yours and outputs from a 10 minutes single node run (the idea was to check that python function are executed). A few interpolation are not performed during these 10 minutes, 52 interp_prof printed while 56 are expected but there are operated on a very large grid (25600 x 20480) without been distributed. To accelerate, you can do the first interpolation with 1 process, while another MPI process is doing another interpolation ...

But in a first can you confirm that this test is going further than yours or not ?

juliencelia commented 3 years ago

I copied your folder and your binary. I add the hydro.txt file in the folder.

I have always this issue: ImportError: libmpi.so.20: cannot open shared object file: No such file or directory

   linux-vdso.so.1 =>  (0x00007ffc8b7ad000)
    /opt/selfie-1.0.2/lib64/selfie.so (0x00002ac55ce98000)
    libhdf5.so.10 => /ccc/products/hdf5-1.8.20/intel--19.0.5.281__openmpi--4.0.1/parallel/lib/libhdf5.so.10 (0x00002ac55d11f000)
    libpython2.7.so.1.0 => /ccc/products/python-2.7.14/intel--17.0.4.196__openmpi--2.0.2/default/lib/libpython2.7.so.1.0 (0x00002ac55d6de000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00002ac55dda0000)
    libdl.so.2 => /lib64/libdl.so.2 (0x00002ac55dfbc000)
    libutil.so.1 => /lib64/libutil.so.1 (0x00002ac55e1c0000)
    libm.so.6 => /lib64/libm.so.6 (0x00002ac55e3c3000)
    libmpi_cxx.so.40 => /ccc/products/openmpi-4.0.2/intel--19.0.5.281/default/lib/libmpi_cxx.so.40 (0x00002ac55e6c5000)
    libmpi.so.40 => /ccc/products/openmpi-4.0.2/intel--19.0.5.281/default/lib/libmpi.so.40 (0x00002ac55e8e1000)
    libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00002ac55ec1c000)
    libiomp5.so => /ccc/products/ifort-19.0.5.281/system/default/19.0.5.281/lib/intel64/libiomp5.so (0x00002ac55ef23000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00002ac55f318000)
    libc.so.6 => /lib64/libc.so.6 (0x00002ac55f52e000)
    libyaml-0.so.2 => /lib64/libyaml-0.so.2 (0x00002ac55f8fc000)
    libz.so.1 => /ccc/products/python-2.7.14/intel--17.0.4.196__openmpi--2.0.2/default/lib/libz.so.1 (0x00002ac55fb1c000)
    libimf.so => /ccc/products/ifort-19.0.5.281/system/default/19.0.5.281/lib/intel64/libimf.so (0x00002ac55fe4b000)
    libsvml.so => /ccc/products/ifort-19.0.5.281/system/default/19.0.5.281/lib/intel64/libsvml.so (0x00002ac5604d0000)
    libirng.so => /ccc/products/ifort-19.0.5.281/system/default/19.0.5.281/lib/intel64/libirng.so (0x00002ac561f5c000)
    libintlc.so.5 => /ccc/products/ifort-19.0.5.281/system/default/19.0.5.281/lib/intel64/libintlc.so.5 (0x00002ac5622c7000)
    libirc.so => /ccc/products2/ifort-17.0.4.196/Atos_7__x86_64/system/default/lib/intel64/libirc.so (0x00002ac562539000)
    /lib64/ld-linux-x86-64.so.2 (0x00002ac55cc74000)
    libopen-rte.so.40 => /ccc/products/openmpi-4.0.2/intel--19.0.5.281/default/lib/libopen-rte.so.40 (0x00002ac5627a3000)
    libopen-pal.so.40 => /ccc/products/openmpi-4.0.2/intel--19.0.5.281/default/lib/libopen-pal.so.40 (0x00002ac562a68000)
    librt.so.1 => /lib64/librt.so.1 (0x00002ac562d2d000)
    libhwloc.so.15 => /ccc/products/hwloc-2.0.4/system/default/lib/libhwloc.so.15 (0x00002ac562f35000)
    libudev.so.1 => /lib64/libudev.so.1 (0x00002ac563180000)
    libpciaccess.so.0 => /lib64/libpciaccess.so.0 (0x00002ac563396000)
    libxml2.so.2 => /ccc/products/python-2.7.14/intel--17.0.4.196__openmpi--2.0.2/default/lib/libxml2.so.2 (0x00002ac5635a0000)
    libevent-2.0.so.5 => /lib64/libevent-2.0.so.5 (0x00002ac563c60000)
    libevent_pthreads-2.0.so.5 => /lib64/libevent_pthreads-2.0.so.5 (0x00002ac563ea8000)
    libcilkrts.so.5 => /ccc/products/ifort-19.0.5.281/system/default/19.0.5.281/lib/intel64/libcilkrts.so.5 (0x00002ac5640ab000)
    libcap.so.2 => /lib64/libcap.so.2 (0x00002ac5642e8000)
    libdw.so.1 => /lib64/libdw.so.1 (0x00002ac5644ed000)
    liblzma.so.5 => /lib64/liblzma.so.5 (0x00002ac56473e000)
    libattr.so.1 => /lib64/libattr.so.1 (0x00002ac564964000)
    libelf.so.1 => /lib64/libelf.so.1 (0x00002ac564b69000)
    libbz2.so.1 => /lib64/libbz2.so.1 (0x00002ac564d81000)
                _            _

| | _ \ \ Version : v4.4-784-gc3f8cc81-master / _| (_) | | () | | _ \ | ' \ | | / -) | | |__/ |||| || || _| || | | //

Reading the simulation parameters

HDF5 version 1.8.20 Python version 2.7.14 Parsing pyinit.py Parsing v4.4-784-gc3f8cc81-master Parsing pyprofiles.py Parsing BNH2d.py On rank 12 [Python] ImportError: libmpi.so.20: cannot open shared object file: No such file or directory ERROR src/Params/Params.cpp:1283 (runScript) error parsing BNH2d.py

jderouillat commented 3 years ago

This morning you was using another Python environment, can you confirm that you reinstall the mpi4py module in this environment ?

juliencelia commented 3 years ago

No I did not. It is hotline that compile smilei with the openmpi env and python/3.7

I tried a simple run with a simulation of a 2D gaussian laser in an empty box. Smilei works with this configuration.

jderouillat commented 3 years ago

Hi Julien, I know that the situation is not completely stabilized since the opening of this issue but the problem evolved a lot (KNL, Rome, MPI, Python, deadlocks ...) and it runs. I propose you to close this issue and if necessary to open a new one dedicated to your eventual new problem.