HiFiLES / HiFiLES-solver

High Fidelity Large Eddy Simulation Solver
Other
171 stars 131 forks source link

error on running HiFiLES on mpicluster #114

Open popstar0426 opened 8 years ago

popstar0426 commented 8 years ago

Hi: I set up a mpicluster follow https://help.ubuntu.com/community/MpichCluster . (There was a difference that I used openmpi 1.6.5 to instead of mpich2 in the process.) The mpicluster works well. I can get something like this:

    $ mpirun -hostfile myhostfile ./mpi_hello
Hello from processor 0 of 8
Hello from processor 1 of 8
Hello from processor 3 of 8
Hello from processor 6 of 8
Hello from processor 4 of 8
Hello from processor 7 of 8
Hello from processor 2 of 8
Hello from processor 5 of 8

Myhostfile is defined as :

$ tail -fn100 myhostfile 
node00 slots=2
node01 slots=3
node02 slots=3

However when I try to execute HiFiLES as the same way, I can not run it. The information shows below:

$mpirun -hostfile myhostfile HiFiLES input_cylinder_visc 
--------------------------------------------------------------------------
mpirun was unable to launch the specified application as it could not find an executable:

Executable: HiFiLES
Node: node01

while attempting to start process rank 2.
--------------------------------------------------------------------------
6 total processes failed to start

I can run HiFiLES at each node sepatately like this:

cluster@node00:~/HiFiLES-solver/testcases/navier-stokes/cylinder$ mpirun -host node00 -n 2 HiFiLES input_cylinder_visc

or ssh to node01 and run HiFiLES like this:

cluster@node01:~/HiFiLES-solver/testcases/navier-stokes/cylinder$  mpirun -host node01 -n 2 HiFiLES input_cylinder_visc

Do you have any idea with running HiFiLES on three nodes together?

Best regards! Yue

popstar0426 commented 8 years ago

Hi: I would like to add some more information. If I give the full directory to HiFiLES like this:

    $ mpirun -machinefile myhostfile /home/cluster/HiFiLES-solver/HiFiLES_CPU/bin/HiFiLES input_cylinder_visc 
 __    __   __   _______  __          __       _______     _______.
|  |  |  | |  | |   ____||  |        |  |     |   ____|   /       |
|  |__|  | |  | |  |__   |  |  _____ |  |     |  |__     |   (----`
|   __   | |  | |   __|  |  | |____| |  |     |   __|     \   \
|  |  |  | |  | |  |     |  |        |  `----.|  |____.----)   |
|__|  |__| |__| |__|     |__|        |_______||_______|_______/

Aerospace Computing Laboratory (Stanford University) 

---------------------- Non-dimensionalization ---------------------
uvw_ref: 69.4256
rho_free_stream: 5.38898e-06
rho_c_ic=1
u_c_ic=1
v_c_ic=0
w_c_ic=0
mu_c_ic=0.05
my_rank=0

----------------------- Mesh Preprocessing ------------------------
reading connectivity ... 
my_rank=1
my_rank=2
my_rank=4
my_rank=5
my_rank=7
my_rank=3
my_rank=6
done reading connectivity
Before parmetis
Partitioning a graph of size 714 serially
      Setup: Max:   0.000, Sum:   0.002, Balance:   1.149
      Remap: Max:   0.001, Sum:   0.004, Balance:   1.117
      Total: Max:   0.006, Sum:   0.049, Balance:   1.005
Final   8-way Cut:     85   Balance: 1.020 
After parmetis 
reading vertices
done reading vertices
Setting up mesh connectivity
Done setting up mesh connectivity
reading boundary conditions
done reading boundary conditions

---------------- Flux Reconstruction Preprocessing ----------------
initializing elements
tris
Fatal error 'environment variable HIFILES_HOME is undefined' at ../src/cubature_1d.cpp:74
Fatal error 'environment variable HIFILES_HOME is undefined' at ../src/cubature_1d.cpp:74
Fatal error 'environment variable HIFILES_HOME is undefined' at ../src/cubature_1d.cpp:74
--------------------------------------------------------------------------
mpirun has exited due to process rank 7 with PID 7868 on
node node02 exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
Fatal error 'environment variable HIFILES_HOME is undefined' at ../src/cubature_1d.cpp:74
Fatal error 'environment variable HIFILES_HOME is undefined' at ../src/cubature_1d.cpp:74

The truth is that I have defined HIFILES_HOME in .bashrc file and source it on node00. And the other nodes can access .bashrc by the NFS share system. Do you have any idea with this problem?

Best regards! Yue

mlopez14 commented 8 years ago

It is certainly strange. Can you ensure that at each node when you type

echo $HIFILES_HOME

you get HiFiLES's main directory?

The ultimate check would be to modify the mpi_hello program you created before and include the lines

const char* HIFILES_DIR = getenv("HIFILES_HOME");
printf('%s', HIFILES_DIR);

in main() and see the program's output. You should be seeing the same directory being printed out as many times as processors you used to call ./mpi_hello. The first line is how HiFiLES captures the value of the environment variable HIFILES_HOME.

popstar0426 commented 8 years ago

Hi: If I ssh to any node_, I can get the correct value of of the environment variable HIFILES_HOME. cluster@node00:~$ echo $HIFILES_HOME /home/cluster/HiFiLES-solver cluster@node01:~$ echo $HIFILES_HOME /home/cluster/HiFiLES-solver cluster@node02:~$ echo $HIFILES_HOME /home/cluster/HiFiLES-solver The mpi_hello.c has been changed to below:

include

include / getenv /

include

int main(int argc, char_* argv) { int myrank, nprocs, namelen; char processor_name[MPI_MAX_PROCESSOR_NAME]; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); MPI_Get_processor_name(processor_name, &namelen); printf("I am Process %d of %d on %s\n", myrank, nprocs, processor_name); MPI_Finalize(); const char* HIFILES_DIR; HIFILES_DIR = getenv ("HIFILES_HOME"); printf ("The current path is: %s\n", HIFILES_DIR); return 0; } The output is cluster@node00:~/work$ mpirun -hostfile myhostfile ./mpi_hello I am Process 1 of 8 on node00 I am Process 0 of 8 on node00 I am Process 5 of 8 on node02 I am Process 7 of 8 on node02 I am Process 2 of 8 on node01 I am Process 6 of 8 on node02 I am Process 3 of 8 on node01 I am Process 4 of 8 on node01 The current path is: /home/cluster/HiFiLES-solver The current path is: /home/cluster/HiFiLES-solver The current path is: (null) The current path is: (null) The current path is: (null) The current path is: (null) The current path is: (null) The current path is: (null) It seems that only node00 can get the correct environment value. Do you have any idea of this problem?

Best regards! Yue

mlopez14 commented 8 years ago

A workaround is to go to that line in Global.cpp and write there the path to your directory.

I think the nodes are not configured to share the bashrc file, but I don't know how you would fix that in your specific setup. On Fri, Dec 11, 2015 at 10:22 PM popstar0426 notifications@github.com wrote:

Hi: If I ssh to any node

_, I can get the correct value of of the environment variable HIFILES_HOME. cluster@node00:~$ echo $HIFILES_HOME /home/cluster/HiFiLES-solver cluster@node01:~$ echo $HIFILES_HOME /home/cluster/HiFiLES-solver cluster@node02:~$ echo $HIFILES_HOME /home/cluster/HiFiLES-solver The mpi_hello.c has been changed to below:

include #include /_ getenv

/ #include int main(int argc, char* argv) { int myrank, nprocs, namelen; char processor_name[MPI_MAX_PROCESSOR_NAME]; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); MPI_Get_processor_name(processor_name, &namelen); printf("I am Process %d of %d on %s\n", myrank, nprocs, processor_name); MPI_Finalize(); const char* HIFILES_DIR; HIFILES_DIR = getenv ("HIFILES_HOME"); printf ("The current path is: %s\n", HIFILES_DIR); return 0; } The output is cluster@node00:~/work$ mpirun -hostfile myhostfile ./mpi_hello I am Process 1 of 8 on node00 I am Process 0 of 8 on node00 I am Process 5 of 8 on node02 I am Process 7 of 8 on node02 I am Process 2 of 8 on node01 I am Process 6 of 8 on node02 I am Process 3 of 8 on node01 I am Process 4 of 8 on node01 The current path is: /home/cluster/HiFiLES-solver The current path is: /home/cluster/HiFiLES-solver The current path is: (null) The current path is: (null) The current path is: (null) The current path is: (null) The current path is: (null) The current path is: (null) It seems that only node00 can get the correct environment value. Do you have any idea of this problem?

Best regards! Yue

— Reply to this email directly or view it on GitHub https://github.com/HiFiLES/HiFiLES-solver/issues/114#issuecomment-164114615 .

venky187 commented 7 years ago

Hi Yue/Lopez,

I am experiencing similar kind of problem. I tried running HiFiLES on the flat plate testcase. When I do mpirun -n 1 and mpirun -n 2, the code works fine. However, when I do mpirun -n 3, the solver crashes. I have attached the screenshots below for your reference:

hifiles1 hifiles2 hifiles3

Could you please help me with this issue.

Regards, Venky

SRkumar97 commented 1 year ago

Hello! I'm also facing this exact issue of env variable HIFILES_HOME error from cubature_1d.cpp when the testcase is run in cluster/queue

Please, help!

The same testcase however runs without any hassle when it is run using terminal, at the head node. This is really a weird problem which is eating up all my valuable time!, something silly but fishy which is not getting resolved!

I have even tried adding a line before the if HIFILES_DIR==NULL conditional statement in the cubature_1d.cpp and rebuilding the code!

Essentially, the changes I made are adding the following two lines in the cubature_1d.cpp, before the if() statement for HIFILES_DIR

const char* HIFILES_DIR = getenv("HIFILES_HOME"); printf("%s",HIFILES_DIR);

The output shows (null) for every processor in the mpirun, before the error message is printed.

image

So the strange part here, is that the HIFILES_DIR is tricked to be NULL when submitted to the job queue of cluster. However, the same job - testcase runs in the head node even after this re-build trouble free. Please help!

SRkumar97 commented 1 year ago

A Follow up - I now manually set the path of HiFiLES_solver/ to the HIFILES_DIR character in the cubature_1d.cpp, before the if() statement check. And re-built the code once again!

Now the code runs in the main queue as well!

I am happy that now I am able to run in the main queue of the cluster as well as the head node. But it is unclear as to why only when submitted to queue, the default cubature_1d.cpp file throws the Fatal error.

Anyways, thanks a ton!

Ramkumar