CTHYB with intel-MPI - Githubissues

sabrygad commented 6 years ago

Hi;

I am doing LDA+DMFT. I built all TRIQS applications using Intel-MPI, however the code is slow (5-10x) as compared to a benchmark case.

I also noticed that the time printed in my sdout job.out file is much smaller than what the job actually takes. For example, here it said 889 seconds, [Node 0] Simulation lasted: 889 seconds [Node 0] Number of measures: 350000 but the actual timing for each DMFT step is ~5,000; so about 6x slower.

So, may be the data exchange between cores is slow.

Do I have to use Open-MPI instead?

Thanks a lot; Sabry

sabrygad commented 6 years ago

... here are the steps I used to get TRIQS to "work" (it is a little different from the documentation, though, as the steps there did not work):

module load python/anaconda-5.0.1 cmake boost mkl intel-mpi gcc/7.2.0 conda create -n dft python=2 numpy scipy matplotlib tornado mako jinja2 pyzmq h5py mpi4py source activate dft mkdir TRIQS cd TRIQS git clone https://github.com/TRIQS/triqs mkdir triqsbuild cd triqs git checkout 1.4.1 sed -i '43s/${MKL_PATH_WITH_PREFIX}/"${MKL_PATH_WITH_PREFIX}"/' cmake/FindLapack.cmake cd /projects/academic/kofke/software/TRIQS/triqsbuild CC=gcc CXX=g++ CFLAGS=-pthread cmake -DPTYHON_LIBRARY=/user/sabrygad/.conda/envs/dft/lib/libpython2.7.so ../triqs make -j12 make test make install

Does this has to do with the slow behavior I have?

Thanks; Sabry

krivenko commented 6 years ago

Hi,

[Node 0] Simulation lasted: 889 seconds [Node 0] Number of measures: 350000 but the actual timing for each DMFT step is ~5,000; so about 6x slower.

What is exactly 6 times slower? A run of the impurity solver (CTHYB) or solution of the self-consistency equations? If the slowdown is seen in calls to components of dft_tools, you should open an issue on the appropriate issue tracker (I have almost no experience with dft_tools and, therefore, cannot help much in that case).

Also, I would recommend to pass -DCMAKE_BUILD_TYPE=Release as a part of CMake command line when building TRIQS and its applications.

parcollet commented 6 years ago

Check the MKL. We had similar pb recently, the MKL was threading by default, which is not a good idea when running MPI on the nodes. Try something like : export OMP_NUM_THREADS=1 in the env. variables.

sabrygad commented 6 years ago

Thanks for your reply.

It looks like nothing to be blamed (i.e. CTHYB or others). We just learned something interesting about this problem. In our SLURM system, the code runs slow once I use more than one core "--ntasks-per-node" (for whatever many of nodes, "--nodes")! Once I fix number of cores to 1, the code runs much faster (~6x), as it should be (based on a benchmark we have). Setting OMP_NUM_THREADS=1 did not change the CPU time for both slow/fast cases.

I am afraid it has to do with mpirun that is used in WIEN2k run_lapw script that calls CTHYB? As you know srun is recommended with SLURM; however, it did not work for more than a total of one core. For example, with this command in the submit script:

srun  -n 2 pytriqs case.py

I get these errors:

h5repack error: : Could not create file IOError: Unable to create file (unable to open file: name = 'case.h5', errno = 17, error message = 'File exists', flags = 15, o_flags = c2) h5repack error: : unable to open file srun: error: cpn-m26-15-01: task 0: Killed

So, it seems it is how pytriqs talks to mpirun/srun; does TRIQS only work with mpirun?

Thanks; Sabry

aichhorn commented 6 years ago

This indeed looks very much like a threading problem. One comments: Are you sure that setting OMP_NUM_THREADS=1 is really done in your slurm environment, or just in the shell where you submit the job? This might be different, I had a similar issue on a supercomputer some time ago. There, environmental variables had to be set explicitly in the submit script, since the job just did not take them from the submit shell. If nothing works, moving to different libraries (no MKL, no intel MPI) could be a solution.

aichhorn commented 6 years ago

And do set 'repacking=False' in the SumK initialisation in your case.py script. This is really not needed in that case and could at least resolve your h5repack problem.

HugoStrand commented 6 years ago

I am closing this due to inactivity. Please reopen if needed. Best, Hugo

TRIQS / cthyb

CTHYB with intel-MPI #88