SmileiPIC / Smilei

Particle-in-cell code for plasma simulation
https://smileipic.github.io/Smilei
326 stars 116 forks source link

Simulation does not run #291

Closed juliencelia closed 3 years ago

juliencelia commented 3 years ago

Dear Smilei experts,

Hope all of you are fine!

I have a simulation that does not begin. I am afraid of having a too big simulation : 25000 * 20000 cells in 2D but I am not sure. The error message is:

Invalid knl_memoryside_cache header, expected "version: 1". [irene3354][[26206,0],315][btl_portals4_component.c:1115] mca_btl_portals4_component_progress_event() ERROR 0: PTL_EVENT_ACK with ni_fail_type 10 (PTL_NI_TARGET_INVALID) with target (nid=508,pid=73) and initator (nid=507,pid=73) found Stack trace (most recent call last):

14 Object "[0xffffffffffffffff]", at 0xffffffffffffffff, in

13 Object "./smileiKNL", at 0x458568, in

12 Object "/lib64/libc.so.6", at 0x2b3e9d86f544, in __libc_start_main

11 Object "./smileiKNL", at 0x8f379f, in main

10 Object "./smileiKNL", at 0x6e93ab, in Params::Params(SmileiMPI*, std::vector<std::string, std::allocator >)

9 Object "/opt/selfie-1.0.2/lib64/selfie.so", at 0x2b3e9b907ab7, in MPI_Barrier

8 Object "/ccc/products/openmpi-2.0.4/intel--17.0.6.256/default/lib/libmpi.so.20", at 0x2b3e9ccdaea0, in MPI_Barrier

7 Object "/ccc/products/openmpi-2.0.4/intel--17.0.6.256/default/lib/libmpi.so.20", at 0x2b3e9cd15a82, in ompi_coll_base_barrier_intra_bruck

6 Object "/opt/mpi/openmpi-icc/2.0.4.5.10.xcea/lib/openmpi/mca_pml_ob1.so", at 0x2b3ea7b527a6, in mca_pml_ob1_send

5 Object "/opt/mpi/openmpi-icc/2.0.4.5.10.xcea/lib/libopen-pal.so.20", at 0x2b3e9ff69330, in opal_progress

4 Object "/opt/mpi/openmpi-icc/2.0.4.5.10.xcea/lib/openmpi/mca_btl_portals4.so", at 0x2b3ea5fd384d, in mca_btl_portals4_component_progress

3 Object "/opt/mpi/openmpi-icc/2.0.4.5.10.xcea/lib/openmpi/mca_btl_portals4.so", at 0x2b3ea5fd3a59, in mca_btl_portals4_component_progress_event

2 Object "/opt/mpi/openmpi-icc/2.0.4.5.10.xcea/lib/libmca_common_portals4.so.20", at 0x2b3ea61defd8, in common_ptl4_printf_error

1 Object "/lib64/libc.so.6", at 0x2b3e9d884a67, in abort

0 Object "/lib64/libc.so.6", at 0x2b3e9d883377, in gsignal

Aborted (Signal sent by tkill() 150381 35221)

The simulation stops at : HDF5 version 1.8.20 Python version 2.7.14 Parsing pyinit.py Parsing v4.4-706-gb5c12a5a-master Parsing pyprofiles.py Parsing BNH2d.py Parsing pycontrol.py Check for function preprocess() python preprocess function does not exist

The version of Smilei is : v4.4-706-gb5c12a5a-master

Thanks for your help. Here is the input:

BNH2d.txt

juliencelia commented 3 years ago

I launched this simulation on IRENE KNL with 200 nodes, 800MPI and 32 openMP per MPI.

jderouillat commented 3 years ago

Hi Julien,

Is your problem reproducible ?

It's crashing during the parsing of the namelist.
The next log should be :

         Calling python _smilei_check

Either there is a problem with one node which didn't start the program correctly (the crash happened in the first MPI_Barrier of the program), either there is a problem executing the python program simultaneously on all nodes.

Julien

juliencelia commented 3 years ago

Hi Julien

I tried 3 times and it always craches at the same point. Before that, I tried with 2500*2000 cells just to check the initial conditions and it worked...

jderouillat commented 3 years ago

The 2500 x 2500 configuration was submitted on the same resource distribution (with 200 nodes, 800MPI and 32 openMP per MPI) ?

juliencelia commented 3 years ago

No just with 40 nodes

jderouillat commented 3 years ago

The reproducible aspect could lead to the second hypothesis but I last year I ran simulations up to 512 KNL nodes on Irene ...

Could you send me all log files and error files ?

juliencelia commented 3 years ago

BNH2d.txt

hydro.txt

err.txt out.txt job_sub.txt

juliencelia commented 3 years ago

Thanks Julien for your help;)

jderouillat commented 3 years ago

Could you check that it goes further if you don't read the hydro file ?

juliencelia commented 3 years ago

Yes it goes further until

Initializing MPI

On rank 0 [Python] NameError: global name 'profilene_BN' is not defined ERROR src/Profiles/Profile.cpp:276 (Profile) Profile nb_density eon: does not seem to return a correct value due to non reading of hydro.txt

I tried with 40 nodes (160MPI)

jderouillat commented 3 years ago

Ok, but on 40 nodes it was ok, the question was for 200 nodes.

juliencelia commented 3 years ago

For 200 nodes it stops at the error above so it goes well further...

juliencelia commented 3 years ago

Hi Julien,

So do you think I can contact TGCC hotline for some help? It is more a problem between the machine and python no?

Thanks

Julien

jderouillat commented 3 years ago

Not really, it's a problem of the namelist.
In which you ask to all process to read simultaneously the same file. It's known that it's not a good practice.
You should read the file by only one process and then broadcast it to all process.
Replace :

x_prof,y_prof,Te_prof,Ti_prof,ne_prof,vx0_prof,vy0_prof = np.loadtxt('hydro.txt', unpack=True)

By something which seems to :

from  mpi4py  import  MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank ()
x_prof,y_prof,Te_prof,Ti_prof,ne_prof,vx0_prof,vy0_prof = 0
if rank==0 :
    x_prof,y_prof,Te_prof,Ti_prof,ne_prof,vx0_prof,vy0_prof =np.loadtxt('hydro.txt', unpack=True)
comm.bcast( x_prof, root = 0)
...
juliencelia commented 3 years ago

Thanks Julien. I will study that. I am quite puzzled or I dont understand because I often run this kind of simulation reading hydro files with gas targets. With simulations of 1mm60µm (200001200 cells) and it worked perfectly for 200 nodes / 800 MPI and 16 open MP. For this simulation, the number of cells is more important but not the number of nodes...

jderouillat commented 3 years ago

More there is data to read, more this problem can appear.

During our last internal Smilei meeting (few hours before the creation of this issue !), we discussed how to provide a benchmark for this kind of problem which is at boundaries of the code itself thanks to Python namelists.

jderouillat commented 3 years ago

Be careful with the mpi4py module that you will use, it must be compatible with your main MPI library.
The best thing is to recompile it for your environment, you can download it on bitbucket (git clone https://bitbucket.org/mpi4py/mpi4py.git).
Then to compile it :

$ python setup.py build
$ python setup.py install --user

And some corrections in the namelist :

from  mpi4py  import  MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank ()
x_prof = np.full(6600, 0.)
y_prof = np.full(6600, 0.)
...
if rank==0 :
    x_prof,y_prof,Te_prof,Ti_prof,ne_prof,vx0_prof,vy0_prof =np.loadtxt('hydro.txt', unpack=True)
x_prof = comm.bcast( x_prof, root = 0)
y_prof = comm.bcast( y_prof, root = 0)
...
juliencelia commented 3 years ago

I tried this morning on several timesteps with low resolution and it works with: from mpi4py import MPI comm = MPI.COMM_WORLD rank = comm.Get_rank ()

if rank==0 : x_prof,y_prof,Te_prof,Ti_prof,ne_prof,vx0_prof,vy0_prof = np.loadtxt('hydro.txt', unpack=True) else: x_prof,y_prof,Te_prof,Ti_prof,ne_prof,vx0_prof,vy0_prof = np.empty(6600,dtype='float64')

comm.bcast( x_prof, root = 0) comm.bcast( y_prof, root = 0) comm.bcast( Te_prof, root = 0) comm.bcast( Ti_prof, root = 0) comm.bcast( ne_prof, root = 0) comm.bcast( vx0_prof, root = 0) comm.bcast( vy0_prof, root = 0)

Do you think I have to recompile anymway?

jderouillat commented 3 years ago

If it's work, no. The mpi4py should be compatible with the MPI library. Otherwise, it would crashed.

juliencelia commented 3 years ago

I am sorry Julien but the simulation still craches at the same point even with brodcast... I will decrease the number of nodes and make some tests... I will come back after to tell you

jderouillat commented 3 years ago

Now you can contact the hotline.
Do not hesitate to put me in cc.

jderouillat commented 3 years ago

I have just remember a Smilei case which deadlock on Irene KNL with large Python array.
It has been solved using smaller patches but we didn't really solve it.

Waiting for a better solution, a workaround could consist in spliting hydro.txt in many files with many arrays associated. I will do some test on my own.

juliencelia commented 3 years ago

It's quite strange! I have now this error message in my out: On rank 145 [Python] ValueError: too many values to unpack ERROR src/Params/Params.cpp:1283 (runScript) error parsing BNH2d.py

and this in the err file:

Stack trace (most recent call last):

5 Object "[0xffffffffffffffff]", at 0xffffffffffffffff, in

4 Object "./smileiKNL", at 0x458568, in

3 Object "/lib64/libc.so.6", at 0x2ab2999d0544, in __libc_start_main

2 Object "./smileiKNL", at 0x8f379f, in main

1 Object "./smileiKNL", at 0x6fd5c7, in Params::Params(SmileiMPI*, std::vec$

0 Object "/lib64/libpthread.so.0", at 0x2ac8b93cb4fb, in raise

Segmentation fault (Signal sent by tkill() [0x89950001b20e])

I already ran simulations with hydro txt files (without broadcast) on 200 nodes with KNL and these txt files were more than 1Mo. Here it is 500ko... The only difference was that resolution was less important (15000*600).

When you say many files, have you got an idea of number of files?

Thanks again Julien

mccoys commented 3 years ago

This error is pure python. It means you assign several values to less variables, which is not allowed. For instance "x, y = a, b, c"

juliencelia commented 3 years ago

Thanks Fred! I found my mistake. I relaunch the simulation...

jderouillat commented 3 years ago

I tried to run your case in a new environment (OpenMPI 4) directly on KNL with 1600 MPI on 400 nodes. It crashes during interpolations done in the namelist.
It should achieve 9600 interpolations, it crashes after 7168 (I print a message after each interp_prof).

slurmstepd-irene3002: error: Detected 2 oom-kill event(s) in step 5326988.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

You can maybe try to use 1 MPI per node x 128 openMP threads per node.

I will not run so large simulations without an access to your genci time. I just have for some developments.

juliencelia commented 3 years ago

Thanks Julien. I will try with 1MPI pernode * 128openMP. I will let you know what happens.

juliencelia commented 3 years ago

Hi Julien, I tried with your configuration and I had this error :+1: On rank 263 [Python] QhullError: QH6019 qhull input error: can not scale last$ or cospherical. Use option 'Qz' to add a point at infinity.

While executing: | qhull d Qt Qbb Qz Qc Q12 Options selected for Qhull 2015.2.r 2016/01/18: run-id 83609952 delaunay Qtriangulate Qbbound-last Qz-infinity-point Qcoplanar-keep Q12-no-wide-dup _pre-merge _zero-centrum Qinterior-keep Pgood

    [ERROR](0) src/Params/Params.cpp:1283 (runScript) error parsing BNH2d.py

ofiles.py Parsing BNH2d.py

And in my err file: Stack trace (most recent call last):

5 Object "[0xffffffffffffffff]", at 0xffffffffffffffff, in

4 Object "./smileiKNL", at 0x458568, in

3 Object "/lib64/libc.so.6", at 0x2b33c96ec544, in __libc_start_main

2 Object "./smileiKNL", at 0x8f379f, in main

1 Object "./smileiKNL", at 0x6fd5c7, in Params::Params(SmileiMPI*, std::vec$

0 Object "/lib64/libpthread.so.0", at 0x2ab01b7ec4fb, in raise

Segmentation fault (Signal sent by tkill() [0x89950003f30d])

I can give you access to a future project if we obtain hours (PRACE). The response will arrive soon. I asked for 60M hours on IRENE ROME.

mccoys commented 3 years ago

It looks like this is a python error due to some meshing or interpolation scheme. I believe there is a problem in your griddata.

juliencelia commented 3 years ago

Thanks Mccoys. I simplified the input using "nearest" method rather than "linear". That avoid to fill values with unknown data... It runs on a few nodes. I relaunched on 500 nodes. I keep hope ;)

And it seems in python that :

x_prof = np.empty(3080,dtype='float64') y_prof = np.empty(3080,dtype='float64') .... if rank==0 : x_prof,y_prof,Te_prof,Ti_prof,ne_prof,vx0_prof,vy0_prof = np.loadtxt('hydro.txt', unpack=True)

is different than:

if rank==0 : x_prof,y_prof,Te_prof,Ti_prof,ne_prof,vx0_prof,vy0_prof = np.loadtxt('hydro.txt', unpack=True) else: x_prof = np.empty(3080,dtype='float64') y_prof = np.empty(3080,dtype='float64') ...

juliencelia commented 3 years ago

Hi,

The simulation stopped at : | | _ \ \ Version : v4.4-706-gb5c12a5a-master / _| (_) | | () | | _ \ | ' \ | | / -) | | |__/ |||| || || _| || | | //

Reading the simulation parameters

HDF5 version 1.8.20 Python version 2.7.14 Parsing pyinit.py Parsing v4.4-706-gb5c12a5a-master Parsing pyprofiles.py Parsing BNH2d.py Parsing pycontrol.py Check for function preprocess() python preprocess function does not exist

There is no error message. On 500 nodes (500 MPI * 64 openMP). I am waiting for hours from PRACE project.

Thanks.

juliencelia commented 3 years ago

Hi Smilei Team and @jderouillat ,

I succeed in obtain 60M CPU hours on PRACE21 project!! ;) That represents lots of hue SMILEI runs!

So Julien, if you have time and are always OK to look at this big simulation, I can give you an access to this project to do some tests (you can send me your TGCC login to add you on the project).

Julien

juliencelia commented 3 years ago

Dear Smieli team,

Hope you are well.

I come back with this big simulation. I tried to launch it on ROME AMD. I had this error message :

On rank 0 [Python] ImportError: libmpi.so.20: cannot open shared object file:$ ERROR src/Params/Params.cpp:1283 (runScript) error parsing BNH2d.py

So I supposed it was an issue with MPI4py as @jderouillat said me above. So I tried to compile it after downloading it at git clone https://bitbucket.org/mpi4py/mpi4py.git

When I did "python setup.py build" i have this error : Download error on https://pypi.python.org/simple/Cython/: [Errno 101] Network is unreachable -- Some packages may not be found! Couldn't find index page for 'Cython' (maybe misspelled?)

I know this is not a smilei issue but if you have an idea. Thanks in advance

Julien

jderouillat commented 3 years ago

Hi Julien,
Did you set your Rome environment ? Without I get the same error, with a "module load python", it runs.
"Basic" Python modules are provided by the computing center.

juliencelia commented 3 years ago

Hi Julien,

Here are my job submission and env_rome files: job_sub.txt env_smilei_rome.txt

It seems everything is OK. Just a thing, Smilei and env files are all in my home dir on Irene and not on Rome partition but they I can have access to these files from rome environnement. No?

When I do from rome partition: source /ccc/cont003/home/gen6129/bonvalej/.env_smilei_rome

I have an error on intel load...

module dfldatadir/own (Data Directory) cannot be unloaded module ccc/1.0 (CCC User Environment) cannot be unloaded ERROR: Unable to locate a modulefile for 'intel/18.0.3.222' load module flavor/buildmpi/intelmpi/2018 (MPI build flavor) load module flavor/buildcompiler/intel/19 (Compiler build flavor) load module licsrv/intel (License service) load module c/intel/19.0.5.281 (Intel C Compiler) load module c++/intel/19.0.5.281 (Intel C++ Compiler) load module fortran/intel/19.0.5.281 (Intel Fortran compiler) load module feature/mkl/lp64 (MKL feature) load module feature/mkl/sequential (MKL feature) load module feature/mkl/single_node (MKL without mpi interface) load module feature/mkl/vector/avx2 (MKL vectorization feature) load module mkl/19.0.5.281 (Intel MKL LP64 Sequential without mpi interfaces) load module intel/19.0.5.281 (Intel Compiler Suite) load module feature/intelmpi/mpi_compiler/intel (MPI Compiler feature) load module feature/intelmpi/net/ib/ofa (MPI Network backend feature) load module mpi/intelmpi/2018.0.3.222 (Intel MPI) load module python/2.7.14 (Python)

jderouillat commented 3 years ago

There is indeed less Intel compilers on the Rome partition than on Skylake.
Looking at old runs on Rome, there were present before.
You can ask them if it's a bug or not (old IntelMPI libraries are still present !).

I'll propose as soon as possible an environment for the current default compiler (19.0.5.281) of Rome.

juliencelia commented 3 years ago

I agree with you. I ran this kind of simulations 6 months ago with a preparatory access and it worked. I will join hotline.

Thanks again Julien

juliencelia commented 3 years ago

Hi @jderouillat

With hotline, we tried lots of things...until use openmpi. They make me change CXXFLAGS in CXXFLAGS += -O2 -axCORE-AVX2,AVX,CORE-AVX512,MIC-AVX512 -mavx2 -ip -inline-factor=1000 -D__INTEL_SKYLAKE_8168 -qopt-zmm-usage=high -fno-alias #-ipo with this environment module purge module load intel/19.0.5.281 module load mpi/openmpi/4.0.2 module load hdf5/1.8.20 # savez-vous si votre hdf5 est compilé en parallèle ou en séquentiel export HDF5_ROOT_DIR=${HDF5_ROOT} module load python/2.7.17

But smilei does not compile...

I am not enough good in computational to understand their advices.

Do you think that you can see the issue with them?

jderouillat commented 3 years ago

Of course but it's maybe not necessary. You'll find below a protocol to define the environment (I write it here but more like a memo at which we can refer for other users).

If IntelMPI is available we are still recommending to use it, in place of OpenMPI, so first :

$ module unload mpi/openmpi

Doing this, the compiler is unloaded, so reload it with the IntelMPI associated to :

$ module load intel/19.0.5.281
$ module load mpi/intelmpi/2019.0.5.281

Then check if a HDF5 library is available and compatible with your MPI environment. I'm happi to discover that it's the case now :

$ module show hdf5/1.8.20
...
    4 : module load flavor/buildcompiler/intel/19 flavor/buildmpi/intelmpi/2019 flavor/hdf5/parallel
...

Load it as recommended by the computing center :

$ module load hdf5
$ module switch flavor/hdf5/serial flavor/hdf5/parallel
$ export HDF5_ROOT_DIR=${HDF5_ROOT} # This is a variable that we are recommended in the Smilei documentation

Compile with the ad hoc machine file (the recommended flag is -march=core-avx2)

$ make -j8 machine=joliot_curie_rome

I ran a small simulation on 4 AMD nodes with the binary generated by this process.
Don't hesitate to put me in copy of your exchange with the computing center.

jderouillat commented 3 years ago

Argh ! It's not so simple with an additional Python.
Compilation is ok if we add a module load python/2.7.17 but at the runtime it seems that there is a conflict.

juliencelia commented 3 years ago

Yes with module load python when I compile I have this warning message:

/ccc/products/python-2.7.14/intel--17.0.4.196openmpi--2.0.2/default/lib/python2.7/site-packages/numpy/core/include/numpy/ndarraytypes.h(84): warning #2650: attributes ignored here NPY_CHAR NPY_ATTR_DEPRECATE("Use NPY_STRING"), ^ and with python2.7.17 same thing: /ccc/products/python-2.7.17/intel--19.0.5.281openmpi--4.0.1/default/lib/python2.7/site-packages/numpy/core/include/numpy/ndarraytypes.h(84): warning #2650: attributes ignored here NPY_CHAR NPY_ATTR_DEPRECATE("Use NPY_STRING"), ^

jderouillat commented 3 years ago

This is a warning, it could be resolved reinstalling a Python toolchain in the new environment (not sure that the cost of a such burden worth it).
The problem is that python/2.7.17 depends of OpenMPI, so the simplest is to forget IntelMPI and the THREAD_MULTIPLE feature using config=no_mpi_tm :

$ module load hdf5
$ module switch flavor/hdf5/serial flavor/hdf5/parallel
$ export HDF5_ROOT_DIR=${HDF5_ROOT}
$ module load python
$ make clean;make -j8 machine=irene_rome config=no_mpi_tm

I ran the same small simulation on a single AMD node in this environment (few resources are available now).

juliencelia commented 3 years ago

Since this morning, I try but it does not compile. I have an issue with the hdf5 module loading. Hotline make tests. I will tell you ...

jderouillat commented 3 years ago

Is your default environment modified (do you add something in ~/.bash_profile or ~/.bashrc) ? Mine is not.

juliencelia commented 3 years ago

My bashrc is empty (just some env variables export). When I compile I have always a problem of hdf5:

module dfldatadir/own (Data Directory) cannot be unloaded module ccc/1.0 (CCC User Environment) cannot be unloaded load module flavor/hdf5/parallel (HDF5 flavor) load module feature/mkl/single_node (MKL without mpi interface) load module flavor/buildcompiler/intel/19 (Compiler build flavor) load module flavor/buildmpi/openmpi/4.0 (MPI build flavor) load module feature/openmpi/mpi_compiler/intel (MPI Compiler feature) load module flavor/openmpi/standard (Open MPI flavor) load module feature/openmpi/net/auto (MPI Network backend feature) load module licsrv/intel (License service) load module c/intel/19.0.5.281 (Intel C Compiler) load module c++/intel/19.0.5.281 (Intel C++ Compiler) load module fortran/intel/19.0.5.281 (Intel Fortran compiler) load module flavor/libccc_user/hwloc2 (libccc_user flavor) load module hwloc/2.0.4 (Hwloc) load module flavor/hcoll/standard (hcoll flavor) load module feature/hcoll/multicast/enable (Hcoll features) load module sharp/2.0 (Mellanox backend) load module hcoll/4.4.2938 (Mellanox hcoll) load module pmix/3.1.3 (Process Management Interface (PMI) for eXascale) load module flavor/ucx/standard (ucx flavor) load module ucx/1.7.0 (Mellanox backend) load module feature/mkl/lp64 (MKL feature) load module feature/mkl/sequential (MKL feature) load module feature/mkl/vector/avx2 (MKL vectorization feature) load module mkl/19.0.5.281 (Intel MKL LP64 Sequential without mpi interfaces) load module intel/19.0.5.281 (Intel Compiler Suite) load module mpi/openmpi/4.0.2 (Open MPI) load module python/2.7.14 (Python) Cleaning build Creating binary char for src/Python/pyprofiles.py Creating binary char for src/Python/pycontrol.py Creating binary char for src/Python/pyinit.py Checking dependencies for src/Tools/tabulatedFunctions.cpp Checking dependencies for src/Tools/Timer.cpp Checking dependencies for src/Tools/Timers.cpp Checking dependencies for src/Tools/userFunctions.cpp Checking dependencies for src/Tools/Tools.cpp Checking dependencies for src/Tools/H5.cpp Checking dependencies for src/Tools/backward.cpp Checking dependencies for src/Tools/PyTools.cpp In file included from src/Tools/H5.cpp(1): src/Tools/H5.h(4): catastrophic error: cannot open source file "hdf5.h"

include

jderouillat commented 3 years ago

I see the hdf5 flavor in your list but not the main hdf5 module (and the "export HDF5_ROOT_DIR" which is used to find the hdf5.h file.

juliencelia commented 3 years ago

It is because I tried to load directly hdf5 parallel without switch. With the same env as yours I have this issue:

unload module mpi/openmpi/4.0.2 (Open MPI) unload module intel/19.0.5.281 (Intel Compiler Suite) unload module mkl/19.0.5.281 (Intel MKL LP64 Sequential without mpi interfaces) unload module feature/mkl/vector/avx2 (MKL vectorization feature) unload module feature/mkl/sequential (MKL feature) unload module feature/mkl/lp64 (MKL feature) unload module ucx/1.7.0 (Mellanox backend) unload module flavor/ucx/standard (ucx flavor) unload module pmix/3.1.3 (Process Management Interface (PMI) for eXascale) unload module hcoll/4.4.2938 (Mellanox hcoll) unload module sharp/2.0 (Mellanox backend) unload module feature/hcoll/multicast/enable (Hcoll features) unload module flavor/hcoll/standard (hcoll flavor) unload module hwloc/2.0.4 (Hwloc) unload module flavor/libccc_user/hwloc2 (libccc_user flavor) unload module fortran/intel/19.0.5.281 (Intel Fortran compiler) unload module c++/intel/19.0.5.281 (Intel C++ Compiler) unload module c/intel/19.0.5.281 (Intel C Compiler) unload module licsrv/intel (License service) unload module feature/openmpi/net/auto (MPI Network backend feature) unload module flavor/openmpi/standard (Open MPI flavor) unload module feature/openmpi/mpi_compiler/intel (MPI Compiler feature) unload module flavor/buildmpi/openmpi/4.0 (MPI build flavor) unload module flavor/buildcompiler/intel/19 (Compiler build flavor) unload module feature/mkl/single_node (MKL without mpi interface) module dfldatadir/own (Data Directory) cannot be unloaded module ccc/1.0 (CCC User Environment) cannot be unloaded load module flavor/hdf5/serial (HDF5 flavor) load module flavor/buildcompiler/intel/19 (Compiler build flavor) load module licsrv/intel (License service) load module c/intel/19.0.5.281 (Intel C Compiler) load module c++/intel/19.0.5.281 (Intel C++ Compiler) load module fortran/intel/19.0.5.281 (Intel Fortran compiler) load module feature/mkl/lp64 (MKL feature) load module feature/mkl/sequential (MKL feature) load module feature/mkl/single_node (MKL without mpi interface) load module feature/mkl/vector/avx2 (MKL vectorization feature) load module mkl/19.0.5.281 (Intel MKL LP64 Sequential without mpi interfaces) load module intel/19.0.5.281 (Intel Compiler Suite) load module hdf5/1.8.20 (HDF5) unload module hdf5/1.8.20 (HDF5) unload module flavor/hdf5/serial (HDF5 flavor) load module flavor/hdf5/parallel (HDF5 flavor)

Loading hdf5/1.8.20 ERROR: hdf5/1.8.20 cannot be loaded due to missing prereq. HINT: the following module must be loaded first: mpi

Switching from flavor/hdf5/serial to flavor/hdf5/parallel WARNING: Reload of dependent hdf5/1.8.20 failed load module flavor/buildmpi/openmpi/4.0 (MPI build flavor) load module feature/openmpi/mpi_compiler/intel (MPI Compiler feature) load module flavor/openmpi/standard (Open MPI flavor) load module feature/openmpi/net/auto (MPI Network backend feature) load module flavor/libccc_user/hwloc2 (libccc_user flavor) load module hwloc/2.0.4 (Hwloc) load module flavor/hcoll/standard (hcoll flavor) load module feature/hcoll/multicast/enable (Hcoll features) load module sharp/2.0 (Mellanox backend) load module hcoll/4.4.2938 (Mellanox hcoll) load module pmix/3.1.3 (Process Management Interface (PMI) for eXascale) load module flavor/ucx/standard (ucx flavor) load module ucx/1.7.0 (Mellanox backend) load module mpi/openmpi/4.0.2 (Open MPI) load module python/2.7.14 (Python) Cleaning build Creating binary char for src/Python/pyprofiles.py Creating binary char for src/Python/pycontrol.py Creating binary char for src/Python/pyinit.py Checking dependencies for src/Tools/tabulatedFunctions.cpp Checking dependencies for src/Tools/Timer.cpp Checking dependencies for src/Tools/userFunctions.cpp Checking dependencies for src/Tools/Timers.cpp Checking dependencies for src/Tools/Tools.cpp Checking dependencies for src/Tools/H5.cpp Checking dependencies for src/Tools/backward.cpp Checking dependencies for src/Tools/PyTools.cpp In file included from src/Tools/H5.cpp(1): src/Tools/H5.h(4): catastrophic error: cannot open source file "hdf5.h"

include

               ^

I retried with a module load mpi/openmpi/4.0.2. Smilei compiles now!

I just have to check if with that python config scipy.interpolate can be import. I need it in many of my runs to interpolate hydrodynamics data ;)

juliencelia commented 3 years ago

Hi @jderouillat

The problem is always the use of scipy module. I tried to compile SMILEI with python3 with this export in env but teh compilation crashes (export PYTHONEXE=$PYTHON_EXEDIR)

Hotline advice me to use smilei4.1 with load module smilei but there was lots of changes since 4.1 version....

I am puzzled with that.

jderouillat commented 3 years ago

By default, the python module provided by the computing center do not set the following variable. The Python library associated to the smilei binary is the system one (check with ldd PATH_TO/smilei). Can you resubmit your job adding in your batch script (after the module load python) :

export LD_LIBRARY_PATH=$PYTHON_ROOT/lib:$LD_LIBRARY_PATH

I do not set LD_PRELOAD, and just use ccc_mprun ./smilei BNH2d.py

juliencelia commented 3 years ago

I am sorry but the simulation crashes always at the same point

ImportError: libmpi.so.20: cannot open shared object file:

My rome env to compile is like that : module purge module load mpi/openmpi/4.0.2 module load hdf5 module switch flavor/hdf5/serial flavor/hdf5/parallel export HDF5_ROOT_DIR=${HDF5_ROOT} module load python

My compile file: source /ccc/cont003/home/ra5390/bonvalej/.env_smilei_rome

compile smilei

cd Smilei_hub make clean make -j8 machine=joliot_curie_rome config=no_mpi_tm mv smilei smileirome mv smilei_test smileirome_test cd ../.

I really don't understand...