Closed sbryngelson closed 1 month ago
Presumably, it's failing somewhere inside one of the benchmark runs but the current workflow doesn't provide a debug medium for this @henryleberre
@sbryngelson The workflow uploads the directory that holds all the logs. We should be able to take a look.
related to https://github.com/MFlowCode/MFC/issues/396 ?
note:
mfc: OK > :) Running simulation:
+ mpirun -np 2 --bind-to none /storage/coda1/p-sbryngelson3/0/sbryngelson3/runners/actions-runner-4/_work/MFC/MFC/master/build/install/be2046126a/bin/simulation
--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:
Local host: atl1-1-01-004-31-0
Device name: mlx5_0
Device vendor ID: 0x02c9
Device vendor part ID: 4123
Default device parameters will be used, which may result in lower
performance. You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.
NOTE: You can turn off this warning by setting the MCA parameter
btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: atl1-1-01-004-31-0
Local device: mlx5_0
--------------------------------------------------------------------------
Simulating a case-optimized 158x79x79 case on 2 rank(s) with OpenACC offloading.
[ 0%] Time step 1 of 1001 @ t_step = 0
[ 1%] Time step 2 of 1001 @ t_step = 1
icfl Inf
ICFL is greater than 1.0. Exiting ...
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[atl1-1-01-004-31-0.pace.gatech.edu:46183] 1 more process has sent help message help-mpi-btl-openib.txt / no device params found
[atl1-1-01-004-31-0.pace.gatech.edu:46183] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[atl1-1-01-004-31-0.pace.gatech.edu:46183] 1 more process has sent help message help-mpi-btl-openib.txt / error in device init
mfc: ERROR > :( /storage/coda1/p-sbryngelson3/0/sbryngelson3/runners/actions-runner-4/_work/MFC/MFC/master/build/install/be2046126a/bin/simulation failed with exit code 1.
Error: Submitting batch file for Interactive failed. It can be found here: /storage/coda1/p-sbryngelson3/0/sbryngelson3/runners/actions-runner-4/_work/MFC/MFC/master/benchmarks/hypo_hll/MFC.sh. Please check the file for errors.
./mfc.sh: line 49: 45958 Terminated python3 "$(pwd)/toolchain/main.py" "$@"
mfc: ERROR > mfc.py finished with a 143 exit code.
mfc: (venv) Exiting the Python virtual environment.
related to https://github.com/MFlowCode/MFC/commit/de3e7a1968b748939c7a94c59bf5e9e8b53cd2a7 ?
from @henryleberre
@sbryngelson The workflow uploads the directory that holds all the logs. We should be able to take a look.
Whoops, didn't notice this somehow.
we should be able to bisect the commit that caused this then
All PRs are returning
N/A
on benchmarking workflows.Here is an example
bench-cpu.yaml
file: