gajendra-iitm / starplat

0 stars 4 forks source link

mpirun -np 64 does not work on Aqua #10

Closed RohanK22 closed 1 month ago

RohanK22 commented 1 month ago

Trying to run a simple test program requesting 64 processes fails on Aqua.

The following boost MPI program computes sum of all elements of an array.

#include <boost/mpi.hpp>
#include <vector>
#include <numeric>
#include <iostream>

namespace mpi = boost::mpi;

int main(int argc, char* argv[]) {
    mpi::environment env(argc, argv);
    mpi::communicator world;

    // Size of the large array
    const std::size_t array_size = 1000000;

    // Only the root process (rank 0) initializes the array
    std::vector<int> large_array;
    if (world.rank() == 0) {
        large_array.resize(array_size);
        // Initialize the array with some values, e.g., all 1s for simplicity
        std::iota(large_array.begin(), large_array.end(), 1); // [1, 2, 3, ..., array_size]
    }

    // Determine the size of each subarray
    std::size_t subarray_size = array_size / world.size();
    std::vector<int> subarray(subarray_size);

    // Scatter the large array to all processes
    mpi::scatter(world, large_array, subarray.data(), subarray_size, 0);

    // Each process computes the sum of its subarray
    int local_sum = std::accumulate(subarray.begin(), subarray.end(), 0);

    int total_sum = 0;
    mpi::reduce(world, local_sum, total_sum, std::plus<int>(), 0);

    // The root process prints the total sum
    if (world.rank() == 0) {
        std::cout << "Total sum: " << total_sum << std::endl;
    }

    return 0;
}

Compile bash mpicxx -g -std=c++17 -I/lfs/usr/home/oth/rnintern/scratch/MPI_Comparison/boost/install_dir/include -L/lfs/usrhome/oth/rnintern/scratch/MPI_Comparison/boost/install_dir/lib large_file.cpp /lfs/usrhome/oth/rnintern/scratch/MPI_Comparison/boost/install_dir/lib/libboost_mpi.a -o lfb

Requesting -np 64 fails

(base) [rnintern@aqua sandbox]$ mpirun -np 64 lfb
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 64
slots that were requested by the application:

  lfb

Either request fewer slots for your application, or make more slots
available for use.

A "slot" is the Open MPI term for an allocatable unit where we can
launch a process.  The number of slots available are defined by the
environment in which Open MPI processes are run:

  1. Hostfile, via "slots=N" clauses (N defaults to number of
     processor cores if not provided)
  2. The --host command line parameter, via a ":N" suffix on the
     hostname (N defaults to 1 if not provided)
  3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
  4. If none of a hostfile, the --host command line parameter, or an
     RM is present, Open MPI defaults to the number of processor cores

In all the above cases, if you want Open MPI to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.

Alternatively, you can use the --oversubscribe option to ignore the
number of available slots when deciding the number of processes to
launch.
--------------------------------------------------------------------------

But -np 32 works.

(base) [rnintern@aqua sandbox]$ mpirun -np 32 lfb
Total sum: 1784293664
RohanK22 commented 1 month ago

TODO(Rohan): Try with PBS script.

rupesh0508 commented 1 month ago

Let's try with number of nodes = 1, 2, 3, so we are sure all three nodes in rupesh_gpuq are working. This means np = 32, 64, 96.

durwasa-chakraborty commented 1 month ago

Update:

@RohanK22 , please update the last line of the submit.sh to the following:

# Remove -np 64 or 96 as the resource list will take the value from select and ncpus
/path/to/mpirun -hostfile $PBS_NODEFILE "$PBS_O_WORKDIR/lfb" &> "$PBS_O_WORKDIR/out.txt"

Tested with 32, 64, and 96. (#PBS -l select={1,2,3}:ncpus=32) ✅

@rupesh0508 , can you add Robert to this repository? He has the requisite knowledge to figure this out deterministically.

Apparently, gpu024 behaves anomalously. The code runs for 32 (on a 40-core machine) but doesn't run for 64 because it needs to look for another machine to handle the overflow. If the scheduler assigns gpu024, we encounter an error.

The only deterministic way to solve a large graph is to start the process on three machines and quit the job in a specific order.

rupesh0508 commented 1 month ago

Invited Robert (johnmaxrin). Durwasa, we have three nodes in rupesh_gpuq. Can it be confirmed that the issue is with only one node and not the third one?

RohanK22 commented 1 month ago

@durwasa-chakraborty, I updated my PBS script file and this is what it looks like now:

#!/bin/bash
#PBS -o logfile.log
#PBS -e errorfile_slash.err
#PBS -l walltime=00:60:00
#PBS -l select=3:ncpus=32
#PBS -q rupesh_gpuq

# Load required module
module load openmpi411

export PMIX_MCA_gds=hash

# mpicxx -g -std=c++17 -I/lfs/usrhome/oth/rnintern/scratch/MPI_Comparison/boost/install_dir/include -L/lfs/usrhome/oth/rnintern/scratch/MPI_Comparison/boost/install_dir/lib large_file.cpp /lfs/usrhome/oth/rnintern/scratch/MPI_Comparison/boost/install_dir/lib/libboost_mpi.a -o lfb

cat "$PBS_NODEFILE"

/lfs/sware/openmpi411/bin/mpirun -hostfile $PBS_NODEFILE "$PBS_O_WORKDIR/lfb" &> "$PBS_O_WORKDIR/out.txt"

To check if the requested number of processes is being allocated I added a print statement inside the large_file.cpp file.

 std::cout << "I am process " << world.rank() << " of " << world.size() << std::endl;

On recompiling large_file.cpp and running it with the submit.sh PBS script it looks like only three MPI processes are being allocated based on the print statement output in out.txt

I am process 0 of 3
I am process 2 of 3
I am process 1 of 3
Total sum: 1783293664

Seems like the -np argument might be required if we want to vary the number of processes.

RohanK22 commented 1 month ago

On input from Robert, we need to specify another argument mpiprocs that specifies the number of MPI processes to request in the PBS script.

Passing an mpiprocs argument to PBS -l select fixes the issue. select specifies the number of nodes to use, and mpiprocs specifies the number of processes to request from each node. numcpus is some other argument that is used when running OpenMP programs and is not required for running MPI programs.

// Requesting 3 * 32 = 96 processes
#PBS -l select=3:mpiprocs=32

The updated script is as follows.

#!/bin/bash
#PBS -o logfile.log
#PBS -e errorfile_slash.err
#PBS -l walltime=00:60:00
#PBS -l select=3:mpiprocs=32
#PBS -q rupesh_gpuq

# Load required module
module load openmpi411

export PMIX_MCA_gds=hash

# mpicxx -g -std=c++17 -I/lfs/usrhome/oth/rnintern/scratch/MPI_Comparison/boost/install_dir/include -L/lfs/usrhome/oth/rnintern/scratch/MPI_Comparison/boost/install_dir/lib large_file.cpp /lfs/usrhome/oth/rnintern/scratch/MPI_Comparison/boost/install_dir/lib/libboost_mpi.a -o lfb

cat "$PBS_NODEFILE"

/lfs/sware/openmpi411/bin/mpirun -hostfile $PBS_NODEFILE "$PBS_O_WORKDIR/lfb" &> "$PBS_O_WORKDIR/out.txt"

I tried testing with nodes = 1,2,3 and it seems to be working fine without any errors.

I am process 0 of 96
I am process 1 of 96
I am process 3 of 96
I am process 4 of 96
I am process 7 of 96
I am process 9 of 96
I am process 13 of 96
I am process 17 of 96
I am process 19 of 96
I am process 26 of 96
I am process 28 of 96
I am process 29 of 96
I am process 31 of 96
I am process 35 of 96
I am process 41 of 96
I am process 42 of 96
I am process 45 of 96
I am process 49 of 96
I am process 54 of 96
I am process 55 of 96
I am process 57 of 96
I am process 59 of 96
I am process 67 of 96
I am process 70 of 96
I am process 71 of 96
I am process 72 of 96
I am process 74 of 96
I am process 75 of 96
I am process 86 of 96
I am process 87 of 96
I am process 89 of 96
I am process 2 of 96
I am process 12 of 96
I am process 14 of 96
I am process 15 of 96
I am process 16 of 96
I am process 18 of 96
I am process 21 of 96
I am process 24 of 96
I am process 27 of 96
I am process 34 of 96
I am process 36 of 96
I am process 39 of 96
I am process 44 of 96
I am process 51 of 96
I am process 53 of 96
I am process 65 of 96
I am process 66 of 96
I am process 69 of 96
I am process 73 of 96
I am process 80 of 96
I am process 85 of 96
I am process 92 of 96
I am process 6 of 96
I am process 10 of 96
I am process 20 of 96
I am process 22 of 96
I am process 30 of 96
I am process 33 of 96
I am process 37 of 96
I am process 43 of 96
I am process 48 of 96
I am process 50 of 96
I am process 52 of 96
I am process 58 of 96
I am process 62 of 96
I am process 64 of 96
I am process 77 of 96
I am process 78 of 96
I am process 81 of 96
I am process 82 of 96
I am process 83 of 96
I am process 84 of 96
I am process 90 of 96
I am process 91 of 96
I am process 93 of 96
I am process 94 of 96
I am process 95 of 96
I am process 8 of 96
I am process 25 of 96
I am process 40 of 96
I am process 46 of 96
I am process 47 of 96
I am process 60 of 96
I am process 68 of 96
I am process 76 of 96
I am process 63 of 96
I am process 88 of 96
I am process 23 of 96
I am process 56 of 96
I am process 5 of 96
I am process 11 of 96
I am process 32 of 96
I am process 79 of 96
I am process 38 of 96
I am process 61 of 96
Total sum: 1720295680
RohanK22 commented 1 month ago

On trying to run betweenness centrality computation on graph /lfs1/usrscratch/phd/cs16d003/11suiteDSL/udwt_graphs/USAudwt.txt with -np 96 across 3 nodes, the MPI program is killed after running for over 5 minutes with the following message:

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node gpu024 exited on signal 9 (Killed).
--------------------------------------------------------------------------

Node gpu024.ib0.cm.aqua.iitm.ac.in might be causing this issue.

In a previous comment, I mentioned that using -np 96 seemed to work with the simple test program. However, I just realized that the Total Sum value computed by the program is incorrect. The correct value should be 1784293664, regardless of the choice of -np (as long as np is divisible by 1000000). So something is still definitely going wrong.

rupesh0508 commented 1 month ago

Looks good, Rohan. Thanks for checking. Can we confirm if the other two nodes are working fine? Then we can send one email to HPCE Team specifically mentioning the nodes not working.

RohanK22 commented 1 month ago

I can't manually specify which nodes to use when running a mpiprogram since I don't have permissions to do something like this - to request 32 slots from gpu023 only:

/lfs/sware/openmpi316/bin/mpirun --host gpu023.ib0.cm.aqua.iitm.ac.in:32 "$PBS_O_WORKDIR/lfb" &> "$PBS_O_WORKDIR/out.txt"

I tried each node individually by running the PBS script with select=1:mpiprocs=32:ncpus=32. I have 11 graphs to test on, and each of these graphs get scheduled to run on one of the nodes when I do select=1:mpiprocs=32:ncpus=32 for every graph. I noticed that sporadically, graphs that run on gpu024 fail with message:

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node gpu024 exited on signal 9 (Killed).
--------------------------------------------------------------------------

Graphs that ran on the other two nodes seemed to run fine with no errors. The logs for these single node test runs can be found on Aqua at: /lfs/usrhome/oth/rnintern/scratch/rohan/gajendra-iitm/starplat/graphcode/generated_mpi/output/singlenoderuns/

I tried testing using two nodes at a time by running the PBS script with select=2:mpiprocs=32:ncpus=32. Out of 11 test graphs all of them got scheduled to run on node023 and node024, and all the test runs failed with some sort of segmentation fault on GPU node 24 or 23. This is what one of the log files looks like:

[gpu024:29733:0:29762] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7fc023b50760)
[gpu024:29892:0:29959] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7f7c42311760)
[gpu023:3333 :0:3423] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b513f7d3760)
[gpu023:3310 :0:3403] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b31a46e0760)
[gpu023:3297 :0:3387] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b3efe1ef760)
[gpu024:29727:0:29741] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7fe1ac0d1760)
[gpu024:29814:0:29875] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7f65876d1760)
[gpu023:3265 :0:3349] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b99aedf3760)
[gpu023:3260 :0:3355] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2ba782ccd760)
[gpu024:29808:0:29866] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7fc16bd55760)
[gpu023:3320 :0:3405] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2ae8f8e98760)
[gpu024:29908:0:29982] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7f80bb728760)
[gpu024:29754:0:29792] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7f33c8c75760)
[gpu024:29758:0:29801] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7f8a99145760)
[gpu024:29778:0:29818] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7f39ff154760)
[gpu024:29744:0:29785] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7fb6ea653760)
[gpu024:29731:0:29752] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7fd0d71b3760)
[gpu024:29788:0:29834] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7f80632e7760)
[gpu024:29803:0:29853] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7fb7f022d760)
[gpu024:29732:0:29756] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7f64372f9760)
[gpu024:29800:0:29852] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7f11cc492760)
[gpu024:29737:0:29771] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7fa92fb50760)
[gpu024:29824:0:29893] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7f6fe139e760)
[gpu024:29749:0:29787] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7f30c7925760)
[gpu024:29734:0:29767] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7faf3991f760)
[gpu023:3347 :0:3425] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2af07c7e5760)
[gpu023:3181 :0:3199] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b8af3b67760)
[gpu023:3238 :0:3298] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2ae6a6a6c760)
[gpu023:3280 :0:3371] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b55cdac1760)
[gpu023:3182 :0:3210] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b53b534a760)
[gpu023:3270 :0:3361] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b873479d760)
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
[gpu023:3187 :0:3224] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b5b3a1eb760)
[gpu023:3193 :0:3234] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b70bdeeb760)
[gpu023:3186 :0:3221] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2ab8a454b760)
[gpu023:3228 :0:3271] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b3ff38c4760)
[gpu023:3252 :0:3328] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2afa6b1c2760)
[gpu024:29880:0:29944] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7f8e154a7760)
[gpu023:3206 :0:3251] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2ad21b8a8760)
[gpu023:3211 :0:3257] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2ae3b4aa7760)
[gpu023:3243 :0:3303] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2af35d7ee760)
[gpu024:29728:0:29755] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7fea96ab1760)
[gpu024:29795:0:29851] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7fa9e7a0f760)
[gpu024:29740:0:29775] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7f4888537760)
[gpu024:29770:0:29807] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7f6d8e111760)
[gpu024:29870:0:29945] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7fa564d66760)
[gpu024:29765:0:29804] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7fda71b5a760)
[gpu024:29865:0:29929] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7fe24ee31760)
[gpu024:29835:0:29927] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7ff45dac5760)
[gpu024:29726:0:29739] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7fa2a351a760)
[gpu024:29819:0:29885] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7fe40712c760)
[gpu024:29782:0:29820] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7f703facd760)
[gpu024:29725:0:29738] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7f5f9606a760)
--------------------------------------------------------------------------
mpirun noticed that process rank 63 with PID 29908 on node gpu024 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Please see the detailed logs at: /lfs/usrhome/oth/rnintern/scratch/rohan/gajendra-iitm/starplat/graphcode/generated_mpi/output/twonoderuns/

@durwasa-chakraborty had a better method to deterministically test the faulty node but I feel the crude testing done here does hint that gpu024 might have some problems. I have not tested gpu023 and gpu025 together (-np 64) yet since the PBS script does not let me do it because it handles scheduling by itself. If gpu023 works fine with gpu025 then it would further strengthen the argument that gpu024 is causing the problem.

Note: This is what my PBS script looks like -

#!/bin/bash
#PBS -o logfile.log
#PBS -e errorfile_slash.err
#PBS -l walltime=00:60:00
#PBS -l select=2:mpiprocs=32:ncpus=32
#PBS -q rupesh_gpuq

# Load required module
module load openmpi316

# Set environment variable for PMIX_MCA_gds
export PMIX_MCA_gds=hash

echo "PROB: $PROB"
echo "GRAPH: $GRAPH"
echo "OUTFILE: $OUTFILE"

# Ensure the required environment variables are set
if [ -z "$PROB" ] || [ -z "$GRAPH" ] || [ -z "$OUTFILE" ]; then
  echo "One or more required variables are not set. Exiting."
  exit 1
fi

cat "$PBS_NODEFILE"

# Run the MPI job
/lfs/sware/openmpi316/bin/mpirun -hostfile $PBS_NODEFILE "$PROB" "$GRAPH" &> "$OUTFILE"
rupesh0508 commented 1 month ago

Thanks for the detailed analysis, Rohan. Durwasa, request you to check this once. Meanwhile, I will write to the HPCE team.

durwasa-chakraborty commented 1 month ago

@rupesh0508 ; @RohanK22

I am running the Ackermann function with values that converge after a long time, such as m=4 and n=3 (refer to the Ackermann function on Wikipedia).

int ackermann(int m, int n) {
    if (m == 0) {
        return n + 1;
    } else if (m > 0 && n == 0) {
        return ackermann(m - 1, 1);
    } else if (m > 0 && n > 0) {
        return ackermann(m - 1, ackermann(m, n - 1));
    }
    return -1; // Should not reach here
}

Once all three nodes are running the Ackermann function, I select two of them and terminate the process. I use the Ackermann function because typical functions like while(1) or sleep(long_time) are often terminated by AquaCluster (marked as 'E'), and we need a deterministic way to simulate long-running evaluations.

Once the tasks are deleted from two of the nodes, any subsequent MPI process runs with two nodes, and multiple MPI processes are assigned to the two nodes we previously terminated. This behavior is deterministic.

So far, I have attempted to write a short function that calculates the sum of a very large array using MPI reduce. The table data below suggests there is no discrepancy in the running:

GPUs Sum GPU Engaged
gpu023/032 + gpu025/032 Total sum: 1784293664 gpu024/0*32
gpu024/032 + gpu025/032 Total sum: 1784293664 gpu023/0*32
gpu023/032 + gpu024/032 Total sum: 1784293664 gpu025/0*32

However, to investigate the issue and reproduce the bug consistently, I would try running on a larger dataset and update this thread if I find something anomalous.

rupesh0508 commented 1 month ago

Great! This means the three nodes are (now?) working. So we will continue with running code. Thanks to both of you for these trials and confirmation.

durwasa-chakraborty commented 1 month ago

@rupesh0508 I am still determining if the three nodes, particularly node 024, function optimally. The flaky behavior is not reproducible but needs to be more consistent. To investigate this issue thoroughly, I intend to conduct further tests using more complex and demanding graphs.

In the meantime, @RohanK22, you can apply the technique I described earlier to run large graphs on nodes 23 and 25 and determine if the same segmentation fault error occurs. The steps are as follows:

RohanK22 commented 1 month ago

@durwasa-chakraborty Sure, I'll try running the large graphs on nodes 23 and 25 and see if it runs without errors.

At the moment however all graph run jobs I submit stop running and go into the Exiting state E. I'm not sure why this is happening. @rupesh0508 any ideas on why this might be happening?

(base) [rnintern@aqua output]$ qstat | grep "rupesh"
1346194.hn1       bc_dslV2_sinawe  rnintern          00:00:00 E rupesh_gpuq     
1346198.hn1       triangle_counti  rnintern                 0 Q rupesh_gpuq 
rupesh0508 commented 1 month ago

This is unclear to me, Rohan. I suggest trying with a smaller graph and simpler algo (TC, PR, SSSP). If it persists, please let me know. We will need to sort this out.

RohanK22 commented 1 month ago

I tried running it with TC with graph /lfs1/usrscratch/phd/cs16d003/11suiteDSL/udwt_graphs/USAudwt.txt. It still seems to just exit the moment the job is submitted.

(base) [rnintern@aqua output]$ qstat | grep "rupesh"
1346687.hn1       triangle_counti  rnintern          00:00:00 E rupesh_gpuq     

I'll show the problem to you in-person tomorrow.

rupesh0508 commented 1 month ago

That's strange. This appears to be a different issue. Let's discuss.

rupesh0508 commented 1 month ago

With the latest development, Durwasa, could you please see if the issue can be closed?

durwasa-chakraborty commented 1 month ago

@RohanK22 @rupesh0508

from stack trace

                                                                                                                 Req'd   Req'd    Elap
Job ID           Username  Queue     Jobname    SessID  NDS  TSK  Memory  Time   S Time
---------------  --------  --------  ---------- ------  ---  ---  ------  -----  - -----
xxxxxx.hn1       yyyyyyy   zzzzzz_g  submit_tc  xxxxxx   3    96    --     01:00  R 00:00
   gpu023/0*32 + gpu024/0*32 + gpu025/0*32

I implemented a sample triangle counting code using one of Stanford University's large graphs: DBLP graph.

The program utilizes Boost and MPI libraries, and I executed it across 96 processes on 3 nodes. The result:

Running on 96 processes.
Total number of triangles: 2224385

This matches the table data provided on the website. Additional tasks such as the short eval "Hello World" and the long-running Ackermann function are also successfully running across all nodes.

Marking this issue as closed and resolved.