MFlowCode / MFC

Exascale simulation of multiphase/physics fluid dynamics
https://mflowcode.github.io
MIT License
132 stars 58 forks source link

Fix benchmark divide by zero with proper error message #394

Closed sbryngelson closed 1 month ago

sbryngelson commented 3 months ago

I'm not sure why it happens but we have an issue where we sometimes get a divide by zero in benchmark diff (e.g., https://github.com/MFlowCode/MFC/actions/runs/8610942830/job/23597260867?pr=285):

.=++*:          -+*+=.        | sbryngelson3@login-phoenix-slurm-2.pace.gatech.edu [Linux]
     :+   -*-        ==   =* .      | ----------------------------------------------------------
   :*+      ==      ++    .+-       | --jobs 1
  :*##-.....:*+   .#%+++=--+=:::.   | --mpi
  -=-++-======#=--**+++==+*++=::-:. | --gpu
 .:++=----------====+*= ==..:%..... | --no-debug
  .:-=++++===--==+=-+=   +.  :=     | --targets pre_process, simulation, and post_process
  +#=::::::::=%=. -+:    =+   *:    | ----------------------------------------------------------
 .*=-=*=..    :=+*+:      -...--    | $ ./mfc.sh (build, run, test, clean, count, packer) --help

 Comparing Bencharks: master/bench-gpu.yaml is x times slower than pr/bench-gpu.yaml.
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /storage/coda1/p-sbryngelson3/0/sbryngelson3/runners/actions-runner-4/_work/ │
│ MFC/MFC/toolchain/main.py:65 in <module>                                     │
│                                                                              │
│   62 │   │                                                                   │
│   63 │   │   __print_greeting()                                              │
│   64 │   │   __checks()                                                      │
│ ❱ 65 │   │   __run()                                                         │
│   66 │                                                                       │
│   67 │   except MFCException as exc:                                         │
│   68 │   │   cons.reset()                                                    │
│                                                                              │
│ /storage/coda1/p-sbryngelson3/0/sbryngelson3/runners/actions-runner-4/_work/ │
│ MFC/MFC/toolchain/main.py:50 in __run                                        │
│                                                                              │
│   47                                                                         │
│   48                                                                         │
│   49 def __run():                                                            │
│ ❱ 50 │   {"test":   test.test,     "run":        run.run,          "build":  │
│   51 │    "clean":  build.clean,   "bench":      bench.bench,      "count":  │
│   52 │    "packer": packer.packer, "count_diff": count.count_diff, "bench_di │
│   53 │   }[ARG("command")]()                                                 │
│                                                                              │
│ /storage/coda1/p-sbryngelson3/0/sbryngelson3/runners/actions-runner-4/_work/ │
│ MFC/MFC/toolchain/mfc/bench.py:119 in diff                                   │
│                                                                              │
│   116 │   │   │   if target.name not in lhs_summary or target.name not in rh │
│   117 │   │   │   │   continue                                               │
│   1[18](https://github.com/MFlowCode/MFC/actions/runs/8610942830/job/23597260867?pr=285#step:5:19) │   │   │                                                              │
│ ❱ 1[19](https://github.com/MFlowCode/MFC/actions/runs/8610942830/job/23597260867?pr=285#step:5:20) │   │   │   speedups[i] = f"{lhs_summary[target.name] / rhs_summary[ta │
│   1[20](https://github.com/MFlowCode/MFC/actions/runs/8610942830/job/23597260867?pr=285#step:5:21) │   │                                                                  │
│   1[21](https://github.com/MFlowCode/MFC/actions/runs/8610942830/job/23597260867?pr=285#step:5:22) │   │   table.add_row(f"[magenta]{slug}[/magenta]", *speedups)         │
│   1[22](https://github.com/MFlowCode/MFC/actions/runs/8610942830/job/23597260867?pr=285#step:5:23)                                                                        │
╰──────────────────────────────────────────────────────────────────────────────╯
ZeroDivisionError: division by zero

ERROR: An unexpected exception occurred: division by zero

./mfc.sh: line 49: [23](https://github.com/MFlowCode/MFC/actions/runs/8610942830/job/23597260867?pr=285#step:5:24)9801 Terminated              python3 "$(pwd)/toolchain/main.py" "$@"

Part of the fix is a proper Python exception if either the lhs_summary[target.name] or rhs_summary[target.name is zero.

anandrdbz commented 3 months ago

Is there a reason why these are integer values and not float ?

sbryngelson commented 3 months ago

@anandrdbz out of convenience, I suppose. Please see this issue for a possible fix https://github.com/MFlowCode/MFC/issues/393

sbryngelson commented 3 months ago

@anandrdbz After watching the last PR fail a few times here https://github.com/MFlowCode/MFC/actions/runs/8638692906/job/23683596528?pr=285

I'm not really sure why there's a divide-by-zero problem or what is happening. It seems like one of the tests failed (either PR or master), but it isn't reporting that. @henryleberre any idea what's going on? Could look into the logs for this as well...

Update: In that PR i think it's because something in the PR is causing all of the cases to output 0 (likely the code @anandrdbz put in the .mako file). I suspect this is the problem whenever we see a divide by zero error.. a case either didn't run or there's a bug in printing its length.

sbryngelson commented 1 month ago

Fixed by #423