compas / grasp

General Relativistic Atomic Structure Package
https://compas.github.io/grasp/
MIT License
56 stars 26 forks source link

setmcp: nblock=0 issue when running rmcdhf_mpi on HPC cluster #78

Closed momierr closed 2 years ago

momierr commented 2 years ago

Hello,

I am trying to use GRASP2018 on my university's HPC cluster. I managed to compile the package correctly and the non-mpi versions of the programs work as expected in an interactive session.

Now, say I want to submit a job file to the HPC cluster (which is controlled by Sun Grid Engine) consisting in a single run of rmcdhf_mpi on 16 cores with the script below:

Screenshot 2021-12-07 at 17 27 23

The executables are on the path, and the job is launched from the directory containing all needed files (mcp.*, isodata, etc as well as the 'disks' file).

What I understand from the output of SGE (text file attached) is that rmcdhf_mpi gets launched and understands where to store the temporary data according to the disks file, but doesn't seem to understand where to get the input files and therefore set the number of blocks to 0 which aborts the calculation.

job_output.txt

Any idea how to solve that? I am probably making very stupid mistakes but I am only a beginner. 😄 Thanks, Rodolphe

jongrumer commented 2 years ago

Hi Radolphe,

Great that you're trying out GRASP for HPC!

It's been many years since I ran the codes through a batch system. But let's see what we can do. For starters, can you show us what your disks file look like? The first line specifies the runtime directory (where the i/o files are stored). The MCP files are written to/read from the temporary directories specified in the subsequent lines of disks when you run rangular_mpi+ rmcdhf_mpi (i.e. not in the working directory as you wrote!). Also, make sure you specify the number of processes somehow (without a batch system the call would be something like mpirun -np 20 rangular_mpi.

All the best, Jon

momierr commented 2 years ago

Hi, Please find the disks file here-after:

disks.txt

The first line: '/work/icb/ro0028mo/atst/test/' is the directory I run the job from, the 16 other lines are for the temporary files as I understood from the GRASP2018 manual (I put here /work/icb/ro0028mo/ only for testing purposes, although each machine of the cluster has a proper /tmp directory).

Before the job crashes, folders 000 to 015 are created by rmcdhf_mpi in the directory specified in the disks file. Nota bene that I just ran rangular (not rangular_mpi) to obtain all the MCP files, which are therefore located in my "test" directory, along with isodata and the other requested files.

As I understood, the command mpiib is a specific command for our HPC cluster and is used to run mpi processes through the infiniband network. As can be seen from the crash report, it is equivalent to mpirun -dapl -np 16 rmcdhf_mpi

Hope my answer will be useful to you, thanks for giving me such a quick feedback. Best, Rodolphe

jongrumer commented 2 years ago

Alright, so you need to run rangular_mpi with the same disks file then. You can't combine serial rangular with rmcdhf_mpi.

Try it! :)

Cheers, Jon

momierr commented 2 years ago

I tried... and it worked! Thanks again Jon for giving such an helpful and quick feedback. 😄 All the best, Rodolphe

jongrumer commented 2 years ago

No worries Radolphe, happy to help! Let me/us know if you have further problems or need advice on how to set up your correlation model. Which group do you work with by the way? Just curious where GRASP is used 🧐😅

Cheers, Jon

momierr commented 2 years ago

I am a (1st year) PhD student working with Prof. Claude Leroy (Dijon, France), Pr. David Sarkisyan and Pr. Aram Papoyan (Yerevan, Armenia): https://www.researchgate.net/profile/Rodolphe-Momier if you want to have a look! I am just getting familiar with GRASP, the goal being the computation of some HFS parameters for states where experimental data is missing. 😄 Cheers, Rodolphe