Closed RohanK22 closed 1 month ago
TODO(Rohan): Try with PBS script.
Let's try with number of nodes = 1, 2, 3, so we are sure all three nodes in rupesh_gpuq are working. This means np = 32, 64, 96.
@RohanK22 , please update the last line of the submit.sh
to the following:
# Remove -np 64 or 96 as the resource list will take the value from select and ncpus
/path/to/mpirun -hostfile $PBS_NODEFILE "$PBS_O_WORKDIR/lfb" &> "$PBS_O_WORKDIR/out.txt"
Tested with 32, 64, and 96. (#PBS -l select={1,2,3}:ncpus=32) ✅
@rupesh0508 , can you add Robert to this repository? He has the requisite knowledge to figure this out deterministically.
Apparently, gpu024
behaves anomalously. The code runs for 32 (on a 40-core machine) but doesn't run for 64 because it needs to look for another machine to handle the overflow. If the scheduler assigns gpu024
, we encounter an error.
The only deterministic way to solve a large graph is to start the process on three machines and quit the job in a specific order.
Invited Robert (johnmaxrin). Durwasa, we have three nodes in rupesh_gpuq. Can it be confirmed that the issue is with only one node and not the third one?
@durwasa-chakraborty, I updated my PBS script file and this is what it looks like now:
#!/bin/bash
#PBS -o logfile.log
#PBS -e errorfile_slash.err
#PBS -l walltime=00:60:00
#PBS -l select=3:ncpus=32
#PBS -q rupesh_gpuq
# Load required module
module load openmpi411
export PMIX_MCA_gds=hash
# mpicxx -g -std=c++17 -I/lfs/usrhome/oth/rnintern/scratch/MPI_Comparison/boost/install_dir/include -L/lfs/usrhome/oth/rnintern/scratch/MPI_Comparison/boost/install_dir/lib large_file.cpp /lfs/usrhome/oth/rnintern/scratch/MPI_Comparison/boost/install_dir/lib/libboost_mpi.a -o lfb
cat "$PBS_NODEFILE"
/lfs/sware/openmpi411/bin/mpirun -hostfile $PBS_NODEFILE "$PBS_O_WORKDIR/lfb" &> "$PBS_O_WORKDIR/out.txt"
To check if the requested number of processes is being allocated I added a print statement inside the large_file.cpp
file.
std::cout << "I am process " << world.rank() << " of " << world.size() << std::endl;
On recompiling large_file.cpp
and running it with the submit.sh
PBS script it looks like only three MPI processes are being allocated based on the print statement output in out.txt
I am process 0 of 3
I am process 2 of 3
I am process 1 of 3
Total sum: 1783293664
Seems like the -np argument might be required if we want to vary the number of processes.
On input from Robert, we need to specify another argument mpiprocs
that specifies the number of MPI processes to request in the PBS script.
Passing an mpiprocs
argument to PBS -l select
fixes the issue. select
specifies the number of nodes to use, and mpiprocs
specifies the number of processes to request from each node. numcpus
is some other argument that is used when running OpenMP programs and is not required for running MPI programs.
// Requesting 3 * 32 = 96 processes
#PBS -l select=3:mpiprocs=32
The updated script is as follows.
#!/bin/bash
#PBS -o logfile.log
#PBS -e errorfile_slash.err
#PBS -l walltime=00:60:00
#PBS -l select=3:mpiprocs=32
#PBS -q rupesh_gpuq
# Load required module
module load openmpi411
export PMIX_MCA_gds=hash
# mpicxx -g -std=c++17 -I/lfs/usrhome/oth/rnintern/scratch/MPI_Comparison/boost/install_dir/include -L/lfs/usrhome/oth/rnintern/scratch/MPI_Comparison/boost/install_dir/lib large_file.cpp /lfs/usrhome/oth/rnintern/scratch/MPI_Comparison/boost/install_dir/lib/libboost_mpi.a -o lfb
cat "$PBS_NODEFILE"
/lfs/sware/openmpi411/bin/mpirun -hostfile $PBS_NODEFILE "$PBS_O_WORKDIR/lfb" &> "$PBS_O_WORKDIR/out.txt"
I tried testing with nodes = 1,2,3 and it seems to be working fine without any errors.
I am process 0 of 96
I am process 1 of 96
I am process 3 of 96
I am process 4 of 96
I am process 7 of 96
I am process 9 of 96
I am process 13 of 96
I am process 17 of 96
I am process 19 of 96
I am process 26 of 96
I am process 28 of 96
I am process 29 of 96
I am process 31 of 96
I am process 35 of 96
I am process 41 of 96
I am process 42 of 96
I am process 45 of 96
I am process 49 of 96
I am process 54 of 96
I am process 55 of 96
I am process 57 of 96
I am process 59 of 96
I am process 67 of 96
I am process 70 of 96
I am process 71 of 96
I am process 72 of 96
I am process 74 of 96
I am process 75 of 96
I am process 86 of 96
I am process 87 of 96
I am process 89 of 96
I am process 2 of 96
I am process 12 of 96
I am process 14 of 96
I am process 15 of 96
I am process 16 of 96
I am process 18 of 96
I am process 21 of 96
I am process 24 of 96
I am process 27 of 96
I am process 34 of 96
I am process 36 of 96
I am process 39 of 96
I am process 44 of 96
I am process 51 of 96
I am process 53 of 96
I am process 65 of 96
I am process 66 of 96
I am process 69 of 96
I am process 73 of 96
I am process 80 of 96
I am process 85 of 96
I am process 92 of 96
I am process 6 of 96
I am process 10 of 96
I am process 20 of 96
I am process 22 of 96
I am process 30 of 96
I am process 33 of 96
I am process 37 of 96
I am process 43 of 96
I am process 48 of 96
I am process 50 of 96
I am process 52 of 96
I am process 58 of 96
I am process 62 of 96
I am process 64 of 96
I am process 77 of 96
I am process 78 of 96
I am process 81 of 96
I am process 82 of 96
I am process 83 of 96
I am process 84 of 96
I am process 90 of 96
I am process 91 of 96
I am process 93 of 96
I am process 94 of 96
I am process 95 of 96
I am process 8 of 96
I am process 25 of 96
I am process 40 of 96
I am process 46 of 96
I am process 47 of 96
I am process 60 of 96
I am process 68 of 96
I am process 76 of 96
I am process 63 of 96
I am process 88 of 96
I am process 23 of 96
I am process 56 of 96
I am process 5 of 96
I am process 11 of 96
I am process 32 of 96
I am process 79 of 96
I am process 38 of 96
I am process 61 of 96
Total sum: 1720295680
On trying to run betweenness centrality computation on graph /lfs1/usrscratch/phd/cs16d003/11suiteDSL/udwt_graphs/USAudwt.txt
with -np 96 across 3 nodes, the MPI program is killed after running for over 5 minutes with the following message:
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node gpu024 exited on signal 9 (Killed).
--------------------------------------------------------------------------
Node gpu024.ib0.cm.aqua.iitm.ac.in
might be causing this issue.
In a previous comment, I mentioned that using -np 96
seemed to work with the simple test program. However, I just realized that the Total Sum
value computed by the program is incorrect. The correct value should be 1784293664
, regardless of the choice of -np
(as long as np is divisible by 1000000). So something is still definitely going wrong.
Looks good, Rohan. Thanks for checking. Can we confirm if the other two nodes are working fine? Then we can send one email to HPCE Team specifically mentioning the nodes not working.
I can't manually specify which nodes to use when running a mpiprogram since I don't have permissions to do something like this - to request 32 slots from gpu023
only:
/lfs/sware/openmpi316/bin/mpirun --host gpu023.ib0.cm.aqua.iitm.ac.in:32 "$PBS_O_WORKDIR/lfb" &> "$PBS_O_WORKDIR/out.txt"
I tried each node individually by running the PBS script with select=1:mpiprocs=32:ncpus=32
. I have 11 graphs to test on, and each of these graphs get scheduled to run on one of the nodes when I do select=1:mpiprocs=32:ncpus=32
for every graph. I noticed that sporadically, graphs that run on gpu024
fail with message:
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node gpu024 exited on signal 9 (Killed).
--------------------------------------------------------------------------
Graphs that ran on the other two nodes seemed to run fine with no errors. The logs for these single node test runs can be found on Aqua at: /lfs/usrhome/oth/rnintern/scratch/rohan/gajendra-iitm/starplat/graphcode/generated_mpi/output/singlenoderuns/
I tried testing using two nodes at a time by running the PBS script with select=2:mpiprocs=32:ncpus=32
. Out of 11 test graphs all of them got scheduled to run on node023
and node024
, and all the test runs failed with some sort of segmentation fault on GPU node 24 or 23. This is what one of the log files looks like:
[gpu024:29733:0:29762] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7fc023b50760)
[gpu024:29892:0:29959] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7f7c42311760)
[gpu023:3333 :0:3423] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b513f7d3760)
[gpu023:3310 :0:3403] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b31a46e0760)
[gpu023:3297 :0:3387] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b3efe1ef760)
[gpu024:29727:0:29741] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7fe1ac0d1760)
[gpu024:29814:0:29875] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7f65876d1760)
[gpu023:3265 :0:3349] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b99aedf3760)
[gpu023:3260 :0:3355] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2ba782ccd760)
[gpu024:29808:0:29866] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7fc16bd55760)
[gpu023:3320 :0:3405] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2ae8f8e98760)
[gpu024:29908:0:29982] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7f80bb728760)
[gpu024:29754:0:29792] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7f33c8c75760)
[gpu024:29758:0:29801] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7f8a99145760)
[gpu024:29778:0:29818] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7f39ff154760)
[gpu024:29744:0:29785] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7fb6ea653760)
[gpu024:29731:0:29752] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7fd0d71b3760)
[gpu024:29788:0:29834] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7f80632e7760)
[gpu024:29803:0:29853] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7fb7f022d760)
[gpu024:29732:0:29756] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7f64372f9760)
[gpu024:29800:0:29852] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7f11cc492760)
[gpu024:29737:0:29771] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7fa92fb50760)
[gpu024:29824:0:29893] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7f6fe139e760)
[gpu024:29749:0:29787] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7f30c7925760)
[gpu024:29734:0:29767] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7faf3991f760)
[gpu023:3347 :0:3425] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2af07c7e5760)
[gpu023:3181 :0:3199] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b8af3b67760)
[gpu023:3238 :0:3298] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2ae6a6a6c760)
[gpu023:3280 :0:3371] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b55cdac1760)
[gpu023:3182 :0:3210] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b53b534a760)
[gpu023:3270 :0:3361] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b873479d760)
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
[gpu023:3187 :0:3224] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b5b3a1eb760)
[gpu023:3193 :0:3234] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b70bdeeb760)
[gpu023:3186 :0:3221] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2ab8a454b760)
[gpu023:3228 :0:3271] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b3ff38c4760)
[gpu023:3252 :0:3328] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2afa6b1c2760)
[gpu024:29880:0:29944] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7f8e154a7760)
[gpu023:3206 :0:3251] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2ad21b8a8760)
[gpu023:3211 :0:3257] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2ae3b4aa7760)
[gpu023:3243 :0:3303] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2af35d7ee760)
[gpu024:29728:0:29755] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7fea96ab1760)
[gpu024:29795:0:29851] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7fa9e7a0f760)
[gpu024:29740:0:29775] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7f4888537760)
[gpu024:29770:0:29807] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7f6d8e111760)
[gpu024:29870:0:29945] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7fa564d66760)
[gpu024:29765:0:29804] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7fda71b5a760)
[gpu024:29865:0:29929] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7fe24ee31760)
[gpu024:29835:0:29927] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7ff45dac5760)
[gpu024:29726:0:29739] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7fa2a351a760)
[gpu024:29819:0:29885] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7fe40712c760)
[gpu024:29782:0:29820] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7f703facd760)
[gpu024:29725:0:29738] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7f5f9606a760)
--------------------------------------------------------------------------
mpirun noticed that process rank 63 with PID 29908 on node gpu024 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Please see the detailed logs at: /lfs/usrhome/oth/rnintern/scratch/rohan/gajendra-iitm/starplat/graphcode/generated_mpi/output/twonoderuns/
@durwasa-chakraborty had a better method to deterministically test the faulty node but I feel the crude testing done here does hint that gpu024
might have some problems. I have not tested gpu023
and gpu025
together (-np 64) yet since the PBS script does not let me do it because it handles scheduling by itself. If gpu023
works fine with gpu025
then it would further strengthen the argument that gpu024
is causing the problem.
Note: This is what my PBS script looks like -
#!/bin/bash
#PBS -o logfile.log
#PBS -e errorfile_slash.err
#PBS -l walltime=00:60:00
#PBS -l select=2:mpiprocs=32:ncpus=32
#PBS -q rupesh_gpuq
# Load required module
module load openmpi316
# Set environment variable for PMIX_MCA_gds
export PMIX_MCA_gds=hash
echo "PROB: $PROB"
echo "GRAPH: $GRAPH"
echo "OUTFILE: $OUTFILE"
# Ensure the required environment variables are set
if [ -z "$PROB" ] || [ -z "$GRAPH" ] || [ -z "$OUTFILE" ]; then
echo "One or more required variables are not set. Exiting."
exit 1
fi
cat "$PBS_NODEFILE"
# Run the MPI job
/lfs/sware/openmpi316/bin/mpirun -hostfile $PBS_NODEFILE "$PROB" "$GRAPH" &> "$OUTFILE"
Thanks for the detailed analysis, Rohan. Durwasa, request you to check this once. Meanwhile, I will write to the HPCE team.
@rupesh0508 ; @RohanK22
I am running the Ackermann function with values that converge after a long time, such as m=4 and n=3 (refer to the Ackermann function on Wikipedia).
int ackermann(int m, int n) {
if (m == 0) {
return n + 1;
} else if (m > 0 && n == 0) {
return ackermann(m - 1, 1);
} else if (m > 0 && n > 0) {
return ackermann(m - 1, ackermann(m, n - 1));
}
return -1; // Should not reach here
}
Once all three nodes are running the Ackermann function, I select two of them and terminate the process. I use the Ackermann function because typical functions like while(1)
or sleep(long_time)
are often terminated by AquaCluster (marked as 'E'), and we need a deterministic way to simulate long-running evaluations.
Once the tasks are deleted from two of the nodes, any subsequent MPI process runs with two nodes, and multiple MPI processes are assigned to the two nodes we previously terminated. This behavior is deterministic.
So far, I have attempted to write a short function that calculates the sum of a very large array using MPI reduce. The table data below suggests there is no discrepancy in the running:
GPUs | Sum | GPU Engaged |
---|---|---|
gpu023/032 + gpu025/032 | Total sum: 1784293664 | gpu024/0*32 |
gpu024/032 + gpu025/032 | Total sum: 1784293664 | gpu023/0*32 |
gpu023/032 + gpu024/032 | Total sum: 1784293664 | gpu025/0*32 |
However, to investigate the issue and reproduce the bug consistently, I would try running on a larger dataset and update this thread if I find something anomalous.
Great! This means the three nodes are (now?) working. So we will continue with running code. Thanks to both of you for these trials and confirmation.
@rupesh0508 I am still determining if the three nodes, particularly node 024, function optimally. The flaky behavior is not reproducible but needs to be more consistent. To investigate this issue thoroughly, I intend to conduct further tests using more complex and demanding graphs.
In the meantime, @RohanK22, you can apply the technique I described earlier to run large graphs on nodes 23 and 25 and determine if the same segmentation fault error occurs. The steps are as follows:
@durwasa-chakraborty Sure, I'll try running the large graphs on nodes 23 and 25 and see if it runs without errors.
At the moment however all graph run jobs I submit stop running and go into the Exiting state E
. I'm not sure why this is happening. @rupesh0508 any ideas on why this might be happening?
(base) [rnintern@aqua output]$ qstat | grep "rupesh"
1346194.hn1 bc_dslV2_sinawe rnintern 00:00:00 E rupesh_gpuq
1346198.hn1 triangle_counti rnintern 0 Q rupesh_gpuq
This is unclear to me, Rohan. I suggest trying with a smaller graph and simpler algo (TC, PR, SSSP). If it persists, please let me know. We will need to sort this out.
I tried running it with TC with graph /lfs1/usrscratch/phd/cs16d003/11suiteDSL/udwt_graphs/USAudwt.txt
. It still seems to just exit the moment the job is submitted.
(base) [rnintern@aqua output]$ qstat | grep "rupesh"
1346687.hn1 triangle_counti rnintern 00:00:00 E rupesh_gpuq
I'll show the problem to you in-person tomorrow.
That's strange. This appears to be a different issue. Let's discuss.
With the latest development, Durwasa, could you please see if the issue can be closed?
@RohanK22 @rupesh0508
from stack trace
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
xxxxxx.hn1 yyyyyyy zzzzzz_g submit_tc xxxxxx 3 96 -- 01:00 R 00:00
gpu023/0*32 + gpu024/0*32 + gpu025/0*32
I implemented a sample triangle counting code using one of Stanford University's large graphs: DBLP graph.
The program utilizes Boost and MPI libraries, and I executed it across 96 processes on 3 nodes. The result:
Running on 96 processes.
Total number of triangles: 2224385
This matches the table data provided on the website. Additional tasks such as the short eval "Hello World" and the long-running Ackermann function are also successfully running across all nodes.
Marking this issue as closed and resolved.
Trying to run a simple test program requesting 64 processes fails on Aqua.
The following boost MPI program computes sum of all elements of an array.
Compile
bash mpicxx -g -std=c++17 -I/lfs/usr/home/oth/rnintern/scratch/MPI_Comparison/boost/install_dir/include -L/lfs/usrhome/oth/rnintern/scratch/MPI_Comparison/boost/install_dir/lib large_file.cpp /lfs/usrhome/oth/rnintern/scratch/MPI_Comparison/boost/install_dir/lib/libboost_mpi.a -o lfb
Requesting -np 64 fails
But -np 32 works.