Closed wilfonba closed 6 months ago
It seems that we should add an argument option to the case file that indicates if it is running from a CPU or GPU. If it's CPU, it should use fewer time steps than the GPU case (such that the benchmarks run for a somewhat similar amount of time).
Please add a description of each case and what features it benchmarks at the top of the case file (or somewhere else easy to find). Right now if one benchmark result is much slower than another, it would not be clear what is causing the slowdown.
Information about what's enabled in each case is already given in comments in bench.yaml. I'll add a GPU vs CPU input though.
can you rename the cases to reflect (roughly) what they're checking?
output.txt
filesTest5/
directoryI'm unsure what Test5
directory you're referring to. I can't find it anywhere, and searching the repo doesn't show anything.
edit: Removed comment about shortening names. Found a decent way.
is this ready to merge @wilfonba ?
I need to finish using 'DICT["gpu"]' in the case file. I was in a rush trying to finish it before going out and kept running into issues because I was rushing. I'll finish it and send a message either tonight or tomorrow.
I'm having issues where the git clone for hdf5 fails more than 3 times. I'll try again tomorrow and hope whatever hairball is causing this is resolved.
@sbryngelson This should be ready to merge as soon as all of the tests run. I haven't verified how long the GPU benchmarking takes, but the CI will indicate if I need to make any changes.
I think the benchmarking CI doesn't work until merged, correct? Though I guess it will run benchmark on the PR, so maybe it does...
Yeah. It won't actually compare anything, but I think it will still run the cases. I can also just run it manually on phoenix.
@wilfonba GitHub was probably rate-limiting your clone requests.
@wilfonba GitHub was probably rate-limiting your clone requests.
I'm not sure -- Checking the GPU logs, all the jobs (and their actual case outputs) completed successfully. Only the toolchain failed (seemingly). It likely failed because the master benchmarks are different cases than the PR ones. We should throw a proper exception here (if that is indeed the case).
Relevant code snippet https://github.com/wilfonba/MFC-Wilfong/blob/ef9cf27924c486cec299d24b97c2b16e5dc0f819/toolchain/mfc/bench.py#L47-L69
Edit: No, problem is in the snippet probably (though it appears to already check? maybe something going wrong there?) https://github.com/wilfonba/MFC-Wilfong/blob/ef9cf27924c486cec299d24b97c2b16e5dc0f819/toolchain/mfc/bench.py#L79-L85
Other than this, I'm glad to merge the PR. The cases are quite good and take ~50 minutes according to the SLURM log in the artifact. If there is a simple fix for this @henryleberre I would like to include it in the PR but if not then we can merge.
@sbryngelson It did fail for that reason, but not in the manner you expected. The code already took care of situations where the cases ran in the two compared benchmarks differ, by only comparing cases whose names are common to both benchmark result files (taking the set intersection of case names). I tested this to make sure it worked.
The error here arose from an oversight - on my end. We also check for when the metadata
field is different between compared benchmarks. It includes both the CLI invocation and the current modes MFC was in (e.g. --gpu
, --mpi
, --debug
, etc.). When they differ, we produce a warning. However, the code I wrote also immediately aborted. I fixed this and gave the warning messages some more polish.
The new output looks like (ran locally):
~/dev/MFC $ ./mfc.sh bench_diff ~/Downloads/logs-gpu/master/bench-gpu.yaml ~/Downloads/logs-gpu/pr/bench-gpu.yaml
mfc: OK > (venv) Entered the Python 3.12.2 virtual environment (>= 3.8).
.=++*: -+*+=. | henryleberre@Henrys-Laptop.local [Darwin]
:+ -*- == =* . | -----------------------------------------
:*+ == ++ .+- | --jobs 1
:*##-.....:*+ .#%+++=--+=:::. | --mpi
-=-++-======#=--**+++==+*++=::-:. | --no-gpu
.:++=----------====+*= ==..:%..... | --no-debug
.:-=++++===--==+=-+= +. := | --targets pre_process, simulation, and post_process
+#=::::::::=%=. -+: =+ *: | ----------------------------------------------------------
.*=-=*=.. :=+*+: -...-- | $ ./mfc.sh (build, run, test, clean, count, packer) --help
Comparing Bencharks: ../../Downloads/logs-gpu/master/bench-gpu.yaml is x times slower than ../../Downloads/logs-gpu/pr/bench-gpu.
Warning: Metadata in lhs and rhs are not equal.
This could mean that the benchmarks are not comparable (e.g. one was run on CPUs and the other on GPUs).
lhs:
* Invocation: bench -j 24 -o bench-gpu.yaml -- -c phoenix --gpu -g 0 1 -n 2
* Modes: debug=False gpu=False mpi=True
rhs:
* Invocation: bench --mem 1 -j 24 -o bench-gpu.yaml -- -c phoenix --gpu -g 0 1 -n 2
* Modes: debug=False gpu=False mpi=True
Warning: Cases in lhs and rhs are not equal.
* rhs cases: ibm, viscous_weno5_sgb_mono, 5eq_rk3_weno3_hllc, hypo_hll.
* lhs cases: 3D_shockdroplet.
Using intersection: set() with 0 elements.
Case Pre Process Simulation Post Process
────────────────────────────────────────────────
mfc: (venv) Exiting the Python virtual environment.
Description
This PR replaces the existing shock droplet benchmark with 5 new benchmarking cases that run for more time-steps and combined cover a large portion of the code's features.
Fixes #363 [optional]
Type of change
Please delete options that are not relevant.
Scope
If you cannot check the above box, please split your PR into multiple PRs that each have a common goal.
How Has This Been Tested?
This PR should run the benchmarks to test the changes.
Checklist
./mfc.sh format
before committing my code