MFlowCode / MFC

Exascale simulation of multiphase/physics fluid dynamics
https://mflowcode.github.io
MIT License
132 stars 56 forks source link

Benchmarking in CI is broken #419

Closed sbryngelson closed 1 month ago

sbryngelson commented 1 month ago

All PRs are returning N/A on benchmarking workflows.

Here is an example bench-cpu.yaml file:

cases:
  5eq_rk3_weno3_hllc:
    description:
      args:
      - -c
      - phoenix
      - -n
      - '4'
      path: /storage/coda1/p-sbryngelson3/0/sbryngelson3/runners/actions-runner-3/_work/MFC/MFC/master/benchmarks/5eq_rk3_weno3_hllc/case.py
      slug: 5eq_rk3_weno3_hllc
    output_summary:
      invocation:
      - run
      - /storage/coda1/p-sbryngelson3/0/sbryngelson3/runners/actions-runner-3/_work/MFC/MFC/master/benchmarks/5eq_rk3_weno3_hllc/case.py
      - '1'
      - --case-optimization
      - --targets
      - pre_process
      - simulation
      - post_process
      - --output-summary
      - /storage/coda1/p-sbryngelson3/0/sbryngelson3/runners/actions-runner-3/_work/MFC/MFC/master/build/benchmarks/a596/5eq_rk3_weno3_hllc.yaml
      - -c
      - phoenix
      - -n
      - '4'
      lock:
        debug: false
        gpu: false
        mpi: true
      pre_process: 1
      syscheck: 1
  hypo_hll:
    description:
      args:
      - -c
      - phoenix
      - -n
      - '4'
      path: /storage/coda1/p-sbryngelson3/0/sbryngelson3/runners/actions-runner-3/_work/MFC/MFC/master/benchmarks/hypo_hll/case.py
      slug: hypo_hll
    output_summary:
      invocation:
      - run
      - /storage/coda1/p-sbryngelson3/0/sbryngelson3/runners/actions-runner-3/_work/MFC/MFC/master/benchmarks/hypo_hll/case.py
      - '1'
      - --case-optimization
      - --targets
      - pre_process
      - simulation
      - post_process
      - --output-summary
      - /storage/coda1/p-sbryngelson3/0/sbryngelson3/runners/actions-runner-3/_work/MFC/MFC/master/build/benchmarks/a596/hypo_hll.yaml
      - -c
      - phoenix
      - -n
      - '4'
      lock:
        debug: false
        gpu: false
        mpi: true
      pre_process: 1
      syscheck: 0
  ibm:
    description:
      args:
      - -c
      - phoenix
      - -n
      - '4'
      path: /storage/coda1/p-sbryngelson3/0/sbryngelson3/runners/actions-runner-3/_work/MFC/MFC/master/benchmarks/ibm/case.py
      slug: ibm
    output_summary:
      invocation:
      - run
      - /storage/coda1/p-sbryngelson3/0/sbryngelson3/runners/actions-runner-3/_work/MFC/MFC/master/benchmarks/ibm/case.py
      - '1'
      - --case-optimization
      - --targets
      - pre_process
      - simulation
      - post_process
      - --output-summary
      - /storage/coda1/p-sbryngelson3/0/sbryngelson3/runners/actions-runner-3/_work/MFC/MFC/master/build/benchmarks/a596/ibm.yaml
      - -c
      - phoenix
      - -n
      - '4'
      lock:
        debug: false
        gpu: false
        mpi: true
      pre_process: 1
      syscheck: 1
  viscous_weno5_sgb_mono:
    description:
      args:
      - -c
      - phoenix
      - -n
      - '4'
      path: /storage/coda1/p-sbryngelson3/0/sbryngelson3/runners/actions-runner-3/_work/MFC/MFC/master/benchmarks/viscous_weno5_sgb_mono/case.py
      slug: viscous_weno5_sgb_mono
    output_summary:
      invocation:
      - run
      - /storage/coda1/p-sbryngelson3/0/sbryngelson3/runners/actions-runner-3/_work/MFC/MFC/master/benchmarks/viscous_weno5_sgb_mono/case.py
      - '1'
      - --case-optimization
      - --targets
      - pre_process
      - simulation
      - post_process
      - --output-summary
      - /storage/coda1/p-sbryngelson3/0/sbryngelson3/runners/actions-runner-3/_work/MFC/MFC/master/build/benchmarks/a596/viscous_weno5_sgb_mono.yaml
      - -c
      - phoenix
      - -n
      - '4'
      lock:
        debug: false
        gpu: false
        mpi: true
      pre_process: 2
      syscheck: 0
metadata:
  invocation:
  - bench
  - --mem
  - '1'
  - -j
  - '24'
  - -o
  - bench-cpu.yaml
  - --
  - -c
  - phoenix
  - -n
  - '4'
  lock:
    debug: false
    gpu: false
    mpi: true
sbryngelson commented 1 month ago

Presumably, it's failing somewhere inside one of the benchmark runs but the current workflow doesn't provide a debug medium for this @henryleberre

henryleberre commented 1 month ago

@sbryngelson The workflow uploads the directory that holds all the logs. We should be able to take a look.

sbryngelson commented 1 month ago

related to https://github.com/MFlowCode/MFC/issues/396 ?

sbryngelson commented 1 month ago

note:

mfc: OK > :) Running simulation:

+ mpirun -np 2 --bind-to none /storage/coda1/p-sbryngelson3/0/sbryngelson3/runners/actions-runner-4/_work/MFC/MFC/master/build/install/be2046126a/bin/simulation
--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:

  Local host:            atl1-1-01-004-31-0
  Device name:           mlx5_0
  Device vendor ID:      0x02c9
  Device vendor part ID: 4123

Default device parameters will be used, which may result in lower
performance.  You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   atl1-1-01-004-31-0
  Local device: mlx5_0
--------------------------------------------------------------------------
 Simulating a case-optimized 158x79x79 case on 2 rank(s) with OpenACC offloading.
 [  0%]  Time step        1 of 1001 @ t_step = 0
 [  1%]  Time step        2 of 1001 @ t_step = 1
 icfl                       Inf
 ICFL is greater than 1.0. Exiting ...
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[atl1-1-01-004-31-0.pace.gatech.edu:46183] 1 more process has sent help message help-mpi-btl-openib.txt / no device params found
[atl1-1-01-004-31-0.pace.gatech.edu:46183] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[atl1-1-01-004-31-0.pace.gatech.edu:46183] 1 more process has sent help message help-mpi-btl-openib.txt / error in device init

mfc: ERROR > :( /storage/coda1/p-sbryngelson3/0/sbryngelson3/runners/actions-runner-4/_work/MFC/MFC/master/build/install/be2046126a/bin/simulation failed with exit code 1.

Error: Submitting batch file for Interactive failed. It can be found here: /storage/coda1/p-sbryngelson3/0/sbryngelson3/runners/actions-runner-4/_work/MFC/MFC/master/benchmarks/hypo_hll/MFC.sh. Please check the file for errors.

./mfc.sh: line 49: 45958 Terminated              python3 "$(pwd)/toolchain/main.py" "$@"

mfc: ERROR > mfc.py finished with a 143 exit code.
mfc: (venv) Exiting the Python virtual environment.

related to https://github.com/MFlowCode/MFC/commit/de3e7a1968b748939c7a94c59bf5e9e8b53cd2a7 ?

from @henryleberre

sbryngelson commented 1 month ago

@sbryngelson The workflow uploads the directory that holds all the logs. We should be able to take a look.

Whoops, didn't notice this somehow.

sbryngelson commented 1 month ago

we should be able to bisect the commit that caused this then