gdtk-uq / gdtk

The Gas Dynamics Toolkit (GDTk) is a set of software tools for simulating high speed fluid flow, maintained at The University of Queensland and the University of Southern Queensland, Australia.
https://gdtk.uqcloud.net/
Other
59 stars 15 forks source link

HYDU_create_process execvp error e4-nk-dist permission denied on Cluster but working fine on Local machine #43

Closed hkishnani closed 8 months ago

hkishnani commented 8 months ago

Error while running on Cluster. Shows

HYDU_create_process (lib/utils/launch.c:73): execvp error on file e4-nk-dist.

hkishnani commented 8 months ago

I am trying to run Eilmer on Cluster. The test code is given in example titled SWBLI in gdtk/src/examples. This is using steady state block, which I have tested on my local machine with same number of grid partitions and it runs perfectly well. When I submit the jobscript with mpirun -n 20 e4-nk-dist --job=swbli

The error that pops up is:

[proxy:0:0@falcon1.cluster.local] HYDU_create_process (lib/utils/launch.c:73): HYDU_create_process (lib/utils/launch.c:73): execvp error on file e4-nk-dist (Permission denied) execvp error on file e4-nk-dist (Permission denied)

can anyone help?

pajacobs-ghub commented 8 months ago

This looks to be a problem local to your cluster and/or collection of files.

What happens when you run a job that tries to just start the program and get its help message? For example:

peterj@helmholtz ~ $ e4-nk-dist --help Eilmer 4.0 compressible-flow simulation code -- using Newton-Krylov accelerator. Revision-id: 5fe12bf3 Revision-date: Sat Dec 16 09:21:39 2023 +1000 Compiler-name: ldc2 Build-date: Mon 18 Dec 2023 05:57:51 AEST Build-flavour: fast Parallelism: Distributed memory with message passing (MPI), number of tasks 1 Usage: e4-nk-dist [OPTIONS] OPTIONS include the following: Option: Comment:

--job= file names built from this string --verbosity= defaults to 0 --snapshot-start=|last defaults to 0 --threads-per-mpi-task= defaults to 1 --max-wall-clock= in seconds --help writes this message peterj@helmholtz ~ $

hkishnani commented 8 months ago

When I do --> e4-nk-dist or e4-nk-dist --help

I get the Permission denied.

[himanshu@falcon1 ~]$ e4-nk-dist -bash: /work/home/himanshu/gdtkinst/bin/e4-nk-dist: Permission denied [himanshu@falcon1 ~]$ e4-nk-dist --help -bash: /work/home/himanshu/gdtkinst/bin/e4-nk-dist: Permission denied

Also, to mention that Eilmer is installed in my local space and have included its path in .bashrc. Few points to mention: 1.) Cluster has CentOS 7 2.) MAKE command was as follows:

make WITH_MPICH=1 DLINKFLAGS=--linker='' FLAVOUR=fast WITH_E4DEBUG=1 WITH_COMPLEX_NUMBERS=1 WITH_NK=1 install

Without DLINKFLAGS the make failed.

3.) Cluster has MPICH 4.1.3 installed.

4.) I also tried the following commands as given on Eilmer's website:

[himanshu@falcon1 ~]$ ln -s ${HOME}/gdtkinst/share/gdtk-module modules/gdtk/production

[himanshu@falcon1 ~]$ module use ${HOME}/UserPackages/modules [himanshu@falcon1 ~]$ module load gdtk/production

afterwhich it shows:

[himanshu@falcon1 ~]$ module list Currently Loaded Modulefiles: 1) oneapi/2022.3/tbb/latest 6) compilers/gcc/13.2.0 2) oneapi/2022.3/compiler-rt/latest 7) apps/cmake/3.28 3) oneapi/2022.3/mkl/2022.2.0 8) apps/mpich/4.1.2 4) oneapi/2022.3/mpi/latest 9) apps/lapack/3.11 5) apps/parmetis/4.0.3 10) gdtk/production

5.) We are using PBS for jobscript.

Thanks a ton Peter. Big fan of Eilmer ~Himanshu

uqngibbo commented 8 months ago

Hi HImanshu,

Can check the contents of your gdtkinst/bin and post the output?

$ cd gdtkinst/bin
$ ls -la

It would also be helpful to see $ which e4shared and $ echo $PATH | tr ':' '\n'

hkishnani commented 8 months ago

Hi Nick, I would also like to highlight that, even for e4mpi and e4zmpi, I get permission denied output.

But the e4shared is running perfectly well. I tested sharp_cone_20_degrees case. Everything is fine there, so why is e4mpi and e4-nk-dist having this problem?

Also, following are the outputs of executed commands:

1.)

$ cd gdtkinst/bin
$ ls -la

_[himanshu@falcon1 bin]$ ls -la total 234048 drwxrwxr-x 2 himanshu himanshu 4096 Dec 18 00:20 . drwxrwxr-x 6 himanshu himanshu 4096 Dec 18 00:20 .. -rwxrwxr-x 1 himanshu himanshu 13560 Dec 18 00:20 chemkin2eilmer -rwxr-xr-x 1 himanshu himanshu 290880 Dec 18 00:20 dgd-lua -rwxr-xr-x 1 himanshu himanshu 210040 Dec 18 00:20 dgd-luac -rwxrwxr-x 1 himanshu himanshu 4143 Dec 18 00:20 e4compact -rwxrwxr-x 1 himanshu himanshu 17634 Dec 18 00:20 e4console.tcl -rwxrwxr-x 1 himanshu himanshu 10692 Dec 18 00:20 e4forces -rwxrwxr-x 1 himanshu himanshu 17374032 Dec 18 00:20 e4loadbalance -rwxrwxr-x 1 himanshu himanshu 1263536 Dec 18 00:20 e4monitor -rwxrwxr-x 1 himanshu himanshu 19790600 Dec 18 00:20 e4mpi -rwxrwxr-x 1 himanshu himanshu 27345616 Dec 18 00:20 e4-nk-dist -rwxrwxr-x 1 himanshu himanshu 19861488 Dec 18 00:20 e4-nk-dist-real -rwxrwxr-x 1 himanshu himanshu 27025240 Dec 18 00:20 e4-nk-shared -rwxrwxr-x 1 himanshu himanshu 19543032 Dec 18 00:20 e4-nk-shared-real -rwxrwxr-x 1 himanshu himanshu 5228 Dec 18 00:20 e4-prep-parallel -rwxrwxr-x 1 himanshu himanshu 5862 Dec 18 00:20 e4-prep-restart -rwxrwxr-x 1 himanshu himanshu 20297680 Dec 18 00:20 e4shared -rwxrwxr-x 1 himanshu himanshu 29620504 Dec 18 00:20 e4shared-debug -rwxrwxr-x 1 himanshu himanshu 27356488 Dec 18 00:20 e4zmpi -rwxrwxr-x 1 himanshu himanshu 27949096 Dec 18 00:20 e4zshared -rw-rw-r-- 1 himanshu himanshu 578 Dec 18 00:20 gdtk-module -rw-rw-r-- 1 himanshu himanshu 2380 Dec 18 00:20 post.lua -rwxrwxr-x 1 himanshu himanshu 11482 Dec 18 00:20 prep-chem -rw-rw-r-- 1 himanshu himanshu 12416 Dec 18 00:20 prep-flow.lua -rwxrwxr-x 1 himanshu himanshu 33776 Dec 18 00:20 prep-gas -rw-rw-r-- 1 himanshu himanshu 4858 Dec 18 00:20 prep-grids.lua -rwxrwxr-x 1 himanshu himanshu 5823 Dec 18 00:20 prep-kinetics -rw-rw-r-- 1 himanshu himanshu 9678 Dec 18 00:20 prep.lua -rwxrwxr-x 1 himanshu himanshu 4308 Dec 18 00:20 species-data-converter -rwxrwxr-x 1 himanshu himanshu 1514456 Dec 18 00:20 ugridpartition -rwxrwxr-x 1 himanshu himanshu 3365 Dec 18 00:20 xtdata.rb

-------------------------------------------------------------------------------------------------

2.) $ which e4shared

[himanshu@falcon1 bin]$ which e4shared ~/gdtkinst/bin/e4shared

-------------------------------------------------------------------------------------------------

3.) echo $PATH | tr ':' '\n'

/work/home/himanshu/UserPackages/ldc/bin /export/apps/lapack/3.11 /export/apps/mpich/4.1.2/sbin /export/apps/mpich/4.1.2/bin /export/apps/cmake/3.28/sbin /export/apps/cmake/3.28/bin /export/apps/gcc/13.2/bin /export/apps/parmetis-4.0.3/sbin /export/apps/parmetis-4.0.3/bin /work/intel/oneapi2022/mpi/2021.7.0/libfabric/bin /work/intel/oneapi2022/mpi/2021.7.0/bin /usr/lib64/qt-3.3/bin /usr/local/bin /usr/bin /usr/local/sbin /usr/sbin /opt/ibutils/bin /opt/pbs/bin /export/apps/mpich/4.1.2/bin /export/apps/mpich/4.1.2/include /work/home/himanshu/UserPackages/ldc/bin /work/home/himanshu/gdtkinst/bin /work/home/himanshu/SU2/SU2-Install/bin /work/home/himanshu/.local/bin /work/home/himanshu/bin

uqngibbo commented 8 months ago

Can you find where the mpi.h file is on your machine and try

$ gcc -E -P /path/to/mpi.h > pppmpi.h

And send me the resulting pppmpi.h file? The problem might be that the MPICH file included with Eilmer is setup for a very specific version of MPICH, which is newer than the one you have.

hkishnani commented 8 months ago

Yes, I ran the following command: gcc -E -P /nfsroot/export/apps/mpich/4.1.2/include/mpi.h > pppmpi.h

and got the output as:

[himanshu@falcon1 ~]$ gcc -E -P /nfsroot/export/apps/mpich/4.1.2/include/mpi.h > pppmpi.h
/nfsroot/export/apps/mpich/4.1.2/include/mpi.h:985:10: fatal error: mpi_proto.h: No such file or directory
  985 | #include <mpi_proto.h>
      |          ^~~~~~~~~~~~~
compilation terminated.

below is the file generated attached: pppmpi.txt

I observed that, if I change it Openmpi, the error 985 | #include | ^~~~~ compilation terminated.

does not appear. And the pppmpi.h file with openmpi module loaded gives ~1600 lines in output, but the one with mpich gives only ~230 lines of output.

Thank a lot

~ Himanshu

uqngibbo commented 8 months ago

Okay that must be because the mpi_proto.h didn't import properly.

You mention that your cluster has an openmpi module. Is it possible to build the code with that instead?

hkishnani commented 8 months ago

Yes, it is possible to build the code with it. So, I will try with openmpi and let you know.

Apart from that, it makes me think, why is Eilmer then running on my workstation with same installation flags and same MPICH version? I tried SWBLI case and successfully ran it using e4-nk-dist on my local machine...

uqngibbo commented 8 months ago

That's very surprising. It's possible that there's enough compatibility in the MPICH header file to work sometimes but not always. We don't have a lot of experience using it, almost always we prefer to use openMPI since it has a nice wrapper we can use.

rjgollan-on-github commented 8 months ago

Apart from that, it makes me think, why is Eilmer then running on my workstation with same installation flags and same MPICH version? I tried SWBLI case and successfully ran it using e4-nk-dist on my local machine...

Hi Himanshu, Here's my guess. On your workstation, you are not using a network communication layer. So MPICH can do its work of "message passing" via memory copies or smart ways of telling ranks where to look for the appropriate data.

On a cluster, a network layer is involved, and typically the queue manager will provide some environment to the MPICH. So I suspect the environment settings provided to MPICH differ between your workstation and what you experience on the cluster.

hkishnani commented 8 months ago

Apart from that, it makes me think, why is Eilmer then running on my workstation with same installation flags and same MPICH version? I tried SWBLI case and successfully ran it using e4-nk-dist on my local machine...

Hi Himanshu, Here's my guess. On your workstation, you are not using a network communication layer. So MPICH can do its work of "message passing" via memory copies or smart ways of telling ranks where to look for the appropriate data.

On a cluster, a network layer is involved, and typically the queue manager will provide some environment to the MPICH. So I suspect the environment settings provided to MPICH differ between your workstation and what you experience on the cluster.

Oh! now I get it. The two things are not the same. This is an insight I didn't knew much about.

Also, Currently I am trying to do it with OpenMPI as suggested by Nick. This could also be possibly the jobscript issue, since Eilmer has been tested with MPICH as highlighted in the examples on website. I don't have much experience in writing jobscripts though, previously also I have messed up some jobscripts. Will try again today and post with updates here.

Thanks a lot

hkishnani commented 8 months ago

When I do --> e4-nk-dist or e4-nk-dist --help

I get the Permission denied.

[himanshu@falcon1 ~]$ e4-nk-dist -bash: /work/home/himanshu/gdtkinst/bin/e4-nk-dist: Permission denied [himanshu@falcon1 ~]$ e4-nk-dist --help -bash: /work/home/himanshu/gdtkinst/bin/e4-nk-dist: Permission denied

Atleast this problem is solved now. e4-nk-dist doesn't give permission denied. It gives the output same as in my pc. Screenshot from 2023-12-25 11-04-37

Also, when I am trying to run it using command: mpirun -np 8 e4-nk-dist --snapshot-start=last --job=swbli | tee -a log.txt I get the same output as I would get on my PC.

One thing I don't understand here is how was I able to do it using MPICH? here is the command which was used to make gdtk: make WITH_MPICH=1 WITH_E4DEBUG=1 WITH_COMPLEX_NUMBERS=1 WITH_NK=1 FLAVOUR=fast install

The reason I am surprised is that just a week back, this command gave me a linker error, because of which I added a DLINKFLAGS=--linker='' as mentioned on Eilmer's webpage on HPC. after which the linker error was gone but e4-nk-dist gave permission denied.

Today, I tried without DLINKFLAGS and I didn't got no error and e4-nk-dist also runs fine using commands. Maybe the cluster is having a bit of mood swings.

hkishnani commented 8 months ago

Now, I think the problem is with jobscript, I will look into it and post the jobscript here.

Thanks a ton Nick, Rowan and Peter

hkishnani commented 8 months ago

ISSUE SOLVED!! The e4-nk-dist issue is resolved. After reading Nick's answer that Eilmer was compatible with MPICH, which was consistent with our observation in our local system as well. It made me think, that something was either wrong with the MPICH installation or the installation. For which I started with a testing for MPICH as given on their installation guide and everything was working perfectly fine. Thus, I started with a clean installation of gdtk and excluded the DLINKFLAGS while installation this time. Earlier I wasn't able to complete the installation without DLINKFLAGS but this time it miraculously didn't pop any error, this mystery I didn't understand but it happened.

After the New installation, I was able to run e4-nk-dist which gave same output as on my local machine and now the Jobscript file is also working as intended.

Below is jobscript file attached for e4-nk-dist of SWBLI case as given on Eilmer's website. Also I have attached the link to folder with all the scripts for anyone who wants to run.

Job_script3.txt

Drive Link to e4-nk-dist

I also urge the managers of package to include one example of e4-nk-dist on HPC on website.

Thanks everyone for helping me out from this.