Problem runnig radiation reaction module on more than one cores

ujjwalsinhaist commented 6 years ago

It is mentioned in the instructions for writing the namelist for Smilei for radiation reaction and multi-photon Breit Wheeler process that if the tables are not provided, then the code generates tables of its own. However, this is not the case. When I try to run the code without the tables, then the code does not run. But, when I copy the tables from the databases directory to the folder where I am running my simulation, the code runs fine. Is there any way to automatically generate the files or we have to have the radiation and multi-photon Breit Wheeler tables beforehand.

The second issue is, when I run my simulations with multi-photon Breit Wheeler processes, with the table in the same folder, my simulations runs perfect. But, when I run the radiation reaction module, it runs fine on a single core, but when I try to run on more than one core, the code crashes giving the following error message:

Initializing radiation reaction
 --------------------------------------------------------------------------------
         The Monte-Carlo Compton radiation module is requested by some species.

         Factor classical raidated power: 2.0051e+03
         Threshold on the quantum parameter for radiation: 1.0000e-03
         Threshold on the quantum parameter for discontinuous radiation: 1.0000e-02
         Table path: ./

         --- Integration F/chipa table:
             Reading of the external database
             Dimension quantum parameter: 256
             Minimum particle quantum parameter chi: 1.0000e-03
             Maximum particle quantum parameter chi: 1.0000e+01
             Buffer size: 2068
         done in 1.5149e-02s
         --- Table chiphmin and xip:
[lfq00:103105] *** Process received signal ***
[lfq00:103105] Signal: Segmentation fault (11)
[lfq00:103105] Signal code: Address not mapped (1)
[lfq00:103105] Failing at address: 0x7fffa74d48d8
[lfq00:103105] [ 0] /lib64/libpthread.so.0(+0xf370)[0x7f3a64d4c370]
[lfq00:103105] [ 1] /lib64/libc.so.6(cfree+0x1c)[0x7f3a6381538c]
[lfq00:103105] [ 2] /home/theo/ujjwal/smilei-v-3.3/smilei(_ZN15RadiationTables21read_integfochi_tableEP9SmileiMPI+0x455)[0x54d055]
[lfq00:103105] [ 3] /home/theo/ujjwal/smilei-v-3.3/smilei(_ZN15RadiationTables24compute_integfochi_tableEP9SmileiMPI+0x54)[0x54d9e4]
[lfq00:103105] [ 4] /home/theo/ujjwal/smilei-v-3.3/smilei(_ZN15RadiationTables14compute_tablesER6ParamsP9SmileiMPI+0x33)[0x5521d3]
[lfq00:103105] [ 5]              Reading of the external database
             Dimension particle chi: 128
             Dimension photon chi: 128
             Minimum particle chi: 1.0000e-03
             Maximum particle chi: 1.0000e+01
             Buffer size for MPI exchange: 132120
/home/theo/ujjwal/smilei-v-3.3/smilei(main+0x135b)[0x42ed7b]
[lfq00:103105] [ 6] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f3a637b6b35]
[lfq00:103105] [ 7] /home/theo/ujjwal/smilei-v-3.3/smilei[0x42fb5f]
[lfq00:103105] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node lfq00 exited on signal 11 (Segmentation fault).

xxirii commented 6 years ago

Hi ujjwalsinhaist,

First of all it is great that you tried to use the QED modules. It is really important to have feedbacks like this.

First issue: you can generate the table without the default one in the same folder. You can also change the path to the table in the input file without making a local copy of the databases. If the code does not detect the tables, they are generated automatically based on the parameter given in the input file or the default one otherwise. You will receive an error saying that there are convergence problems. This is not really an issue, just a warning. Although you have these messages, the tables will be generated. I should improve the code so that these warnings are not so redundant. These warnings are due to the Bessel function computation that have big constraints on accuracy convergence.

Second issue: the second one is however not normal and I would appreciate if you could tell me which input file you use and give me your computer configuration. If you use your own one, can you send me your input file at mathieu.lobet@cea.fr ? Thank you

ujjwalsinhaist commented 6 years ago

Hi Mathieu, Thank you very much for the reply. To begin with, I am using the benchmark input files provided in the benchmarks folder. When I use the input file tst1d_10_pair_electron_laser_collision.py without the database files, I get the following error in the radiation reaction section,

Initializing radiation reaction
 --------------------------------------------------------------------------------
         The Monte-Carlo Compton radiation module is requested by some species.

         Factor classical raidated power: 2.0051e+03
         Threshold on the quantum parameter for radiation: 1.0000e-03
         Threshold on the quantum parameter for discontinuous radiation: 1.0000e-02
         Table path: ./

         --- Integration F/chipa table:
             MPI repartition:
             Rank: 0 imin: 0 length: 32
             Rank: 1 imin: 32 length: 32
             Rank: 2 imin: 64 length: 32
             Rank: 3 imin: 96 length: 32
             Computation:
    [ERROR](0) src/Tools/userFunctions.cpp:148 (modified_bessel_IK) x too large in modified_bessel_IK; try asymptotic expansion
    [ERROR](0) src/Tools/userFunctions.cpp:148 (modified_bessel_IK) x too large in modified_bessel_IK; try asymptotic expansion
    [ERROR](0) src/Tools/userFunctions.cpp:148 (modified_bessel_IK) x too large in modified_bessel_IK; try asymptotic expansion
    [ERROR](0) src/Tools/userFunctions.cpp:148 (modified_bessel_IK) x too large in modified_bessel_IK; try asymptotic expansion
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[32978,1],1]
  Exit code:    1
-----------------------------------------------------------------------------------------------------------------
For the second issue also I am using the same input file. The configuration of my cluster is,

Dell PowerEdge R620 
RAM: 256GB 
CPU: 2 x Intel Xeon E5-2670 
16 Core, 2.6 GHz 
Scientific Linux 6

Upon using one core, the code runs fine. When I use more than one core, then I get the error message mentioned in my previous post. However, this is not the case when I use only the Breit Wheeler process.

For both radiation reaction and multiphoton Breit Wheeler processes, I have to either copy the database files or give the path in the input file.

Thank you,

xxirii commented 6 years ago

The few tests I have done today with this script are working fine for me:

4 mpi ranks / 1 OMP thread per rank
2 mpi ranks / 2 OMP threads per rank I use the gnu compiler with mpich-3.2 and hdf5-1.8.19 locally and Intelmpi on my cluster.

Can you show me how you run the simulation (job script) ? Also, which compiler, mpi and hdf5 version do you use ? I will try to get closer to your configuration.

ujjwalsinhaist commented 6 years ago

I ran with -- 4 mpi ranks/ 1 OMP thread per rank -- 2 mpi ranks/ 1 OMP thread per rank I used gnu compiler with openmpi-2.0.1 and hdf5-1.8.16 on my cluster Please find below my jobscript,

#!/bin/bash
# ------------------------------------
# our name
#$ -N smilei_testrun
#
# pe request 2 slots
#$ -pe openmpi 2
#
#
#Tell the grid engine to use the current working
#directory to execute the script. The current working directory
#is the directory in which you are when you execute qsub (optional
#but I recommend it)
#$ -cwd
#
#The queue in which you want to schedule your task. This is optional
#but for short jobs you may want to set the following which will 
#schedule it in the short queue (ie. maximum job runtime on this queue
#is 48 hours!). 
#####$ -q sht.q
#
#Redirect the standard error file descriptor to this file. $JOB_ID
#is the unique id your job gets upon submission.
#$ -e stderr.$JOB_ID
#
#The same as above but for the standard output file descriptor
#$ -o stdout.$JOB_ID
#
#The mail address to which you want status notifications about your
#job to be delivered to (optional)
#$ -M ujjwal@mpi-hd.mpg.de
#
#The types of notifications you want to receive (required if you want
#notifications). The letters represent different types and can be
#arranged in any way (so "sbea" instead of "base" would have the same
#effect). The types are: "b" - beginning of job, "e" - end of job,
#"a" - abort or reschedule of your job, "s" - suspension of job
#$ -m ase
#
#
ulimit -c 80000000

echo "Got $NSLOTS."
PATH=$PATH:/usr/local/Packages/openmpi-2.0.1-gcc62-sl7/bin
LD_LIBRARY_PATH=/home/theo/ujjwal/hdf5_1_8_16/lib
export LD_LIBRARY_PATH
export OMP_NUM_THREADS=1
mpirun --mca io romio314 ~/smilei-v-3.3/smilei tst1d_10_pair_electron_laser_collision.py

xxirii commented 6 years ago

Hi,

I have discussed with colleagues about this issue and they told me that they had some issues in the past with openmpi 2.x. However, I did a lot of tests with different version of openmpi and hdf5 and I do not have your errors.

Do you think I could have an access to your super-computer with a small amount of computational hours just to understand the bug and try to correct it. Also, can you try different MPI implementations on your supercomputer or at least to run this test on your local computer?

Thank you

ujjwalsinhaist commented 6 years ago

Hi Mathieu,

I did tests with openmpi and mpich 3.2, and I have the following conclusions,

                            openmpi 2.0           openmpi 2.0.1               mpich 3.2

no RR/ no BW -----------yes---------------------yes--------------------------yes no RR/ BW----------------yes---------------------yes--------------------------yes RR/ no BW----------------No----------------------No-------------------------- yes RR/BW-------------------- No----------------------No---------------------------yes

So, for me Smilei works best with mpich 3.2. I do not have an idea as why is it so.

Regarding access to the cluster of my institute, I talked to the administrator and he said that it is not possible according to the rules of the institute. I am sorry about that.

I would like to thank you for your suggestions and help.

Many thanks again

jderouillat commented 6 years ago

Thank you for this feedback. Can I ask you for an additional test which is similar to what you produced. It consists in executing RR/BW with OpenMPI (I think that only one of the 2 release is relevant) but disabling OpenMP (make config=noopenmp). I know that you where using only 1 thread but it changes what happen in the MPI library itself.

ujjwalsinhaist commented 6 years ago

Hi jderouillat, I tried running the input file tst1d_10_pair_electron_laser_collision.py, which has both RR and BW with disabling OpenMP (i.e using make config=noopenmp). I got the following error,

/home/theo/ujjwal/copy-smilei-v3.3/smilei: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by /home/theo/ujjwal/copy-smilei-v3.3/smilei)
/home/theo/ujjwal/copy-smilei-v3.3/smilei: /lib64/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by /home/theo/ujjwal/copy-smilei-v3.3/smilei)
/home/theo/ujjwal/copy-smilei-v3.3/smilei: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /home/theo/ujjwal/copy-smilei-v3.3/smilei)
/home/theo/ujjwal/copy-smilei-v3.3/smilei: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by /home/theo/ujjwal/copy-smilei-v3.3/smilei)
/home/theo/ujjwal/copy-smilei-v3.3/smilei: /lib64/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by /home/theo/ujjwal/copy-smilei-v3.3/smilei)
/home/theo/ujjwal/copy-smilei-v3.3/smilei: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /home/theo/ujjwal/copy-smilei-v3.3/smilei)
/home/theo/ujjwal/copy-smilei-v3.3/smilei: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by /home/theo/ujjwal/copy-smilei-v3.3/smilei)
/home/theo/ujjwal/copy-smilei-v3.3/smilei: /lib64/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by /home/theo/ujjwal/copy-smilei-v3.3/smilei)
/home/theo/ujjwal/copy-smilei-v3.3/smilei: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /home/theo/ujjwal/copy-smilei-v3.3/smilei)
/home/theo/ujjwal/copy-smilei-v3.3/smilei: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by /home/theo/ujjwal/copy-smilei-v3.3/smilei)
/home/theo/ujjwal/copy-smilei-v3.3/smilei: /lib64/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by /home/theo/ujjwal/copy-smilei-v3.3/smilei)
/home/theo/ujjwal/copy-smilei-v3.3/smilei: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /home/theo/ujjwal/copy-smilei-v3.3/smilei)

mccoys commented 6 years ago

@jderouillat If it does not work with openmpi 2., could it be related to the issue we had with that vader* option?

jderouillat commented 6 years ago

The error that you reported is weird, are you sure that you clean completely your compilation ?
Or that the HDF5 is the OpenMPI compatible version and not the mpich ?

@mccoys, not sure for the vador option. We were able to reproduce it, not this one, but @ujjwalsinhaist you can test :

$ mpirun --mca btl ^vader -np 2 tst1d_10_pair_electron_laser_collision.py

mccoys commented 6 years ago

@ujjwalsinhaist

have you been able to fix your problem? Were you able to try our suggestions? We are interested in how this issue could be solved.

ujjwalsinhaist commented 6 years ago

Hi mccoys, I'm sorry I could not respond earlier as I was out of EU for sometime and could not see your post. I have tried the vader option, but it did not work. When I do not use the RR and BW, SMILEI works fine with openmpi. It is only when I use RR that the code only works with mpich3.2.

mccoys commented 5 years ago

Seems outdaded considering last year's changes. Closing.

SmileiPIC / Smilei

Problem runnig radiation reaction module on more than one cores #24