LLNL / benchpark

An open collaborative repository for reproducible specifications of HPC benchmarks and cross site benchmarking environments
https://software.llnl.gov/benchpark/
Apache License 2.0
25 stars 22 forks source link

Update benchpark for ramble package manager abstraction #301

Closed douglasjacobsen closed 1 month ago

douglasjacobsen commented 1 month ago

This merge performs the necessary update to benchpark's configuration and application files to support the package manager abstraction in Ramble.

douglasjacobsen commented 1 month ago

There are other ways you can go about configuring the variants block too by the way. It's possible that you'd prefer these to come from system definitions rather than each experiment, but I'll let you all make that call.

pearce8 commented 1 month ago

Thanks, @douglasjacobsen! We need to update to use new Spack (I think the bug fix we were waiting on is in), and new Ramble, before we can merge this PR.

As for the variants block, considering some of the examples we are expecting that wouldn't necessarily use Spack, I think having this in an experiment (not system) makes more sense. @becker33 @alecbcs do you think it should be defined by default in base experiment, and an individual experiment could overwrite it?

pearce8 commented 1 month ago

I updated Spack and Ramble versions, and enabled CI. ramble workspace archive for some reason is looking for spack.yaml (see the error in Checks) - I don't think this is hardcoded in our CI, so at the moment I am not sure what is triggering this error.

scheibelp commented 1 month ago

@douglasjacobsen regarding the failed CI check, the first Ramble command we are (were) running is ramble workspace setup. This command appears to now do the following:

  1. Concretizes the list of provided packages
  2. ? (I was assuming the packages would be installed)
  3. Resolve the installation paths of the concretized packages

Step 3 fails because step 2 doesn't appear to run. Stepping through the phases, I see spack/package_manager._software_install appears to silently skip installation:

        if self.environment_required(): <-- this fails and guards the installation attempt
            ...

within that call, the part that fails:

 79         def environment_required(self):
 80             app_inst = self.app_inst
 81             if hasattr(app_inst, "software_specs"):
 82                 for pkg, info in app_inst.software_specs.items(): <-- this is empty
 83                     if fnmatch.fnmatch(self.name, info["package_manager"]):
 84                         return True
 85     
 86             return False

The saxpy application in Benchpark is possibly not idiomatic WRT Ramble in that it declares no software_spec for saxpy: we were leaning on the ramble.yaml to do that. Is that no longer an option?

douglasjacobsen commented 1 month ago

@scheibelp This should be fixed with the newest version of Ramble. Thanks for pointing this out!

scheibelp commented 1 month ago

Awesome, it looks like this got to the next phase, the failure should be investigated by us I think.

scheibelp commented 1 month ago

The expanded execute_experiment.tpls were wrong because of some sloppy code in the allocation modifier (that I wrote):

None
rm -f "<benchpark-workspace>/saxpy/openmp/nosite-x86_64/workspace/experiments/saxpy/problem/saxpy_512_1_2/saxpy_512_1_2.out"
touch "<benchpark-workspace>/saxpy/openmp/nosite-x86_64/workspace/experiments/saxpy/problem/saxpy_512_1_2/saxpy_512_1_2.out"
rm -f "<benchpark-workspace>/saxpy/openmp/nosite-x86_64/workspace/experiments/saxpy/problem/saxpy_512_1_2/saxpy_512_1_2.out"
touch "<benchpark-workspace>/saxpy/openmp/nosite-x86_64/workspace/experiments/saxpy/problem/saxpy_512_1_2/saxpy_512_1_2.out"

d53da9b gets rid of "None", but I haven't yet figured out why (or fixed) the repeated elements. That seems to be the only part that's repeated, so the resulting expanded script should run, but it would be good to know why it happens twice

douglasjacobsen commented 1 month ago

@scheibelp There are two reasons why the log files are repeated. The first is a ramble thing (https://github.com/GoogleCloudPlatform/ramble/pull/568) the second is because your saxpy application definition has an executable which redirects to '{log_file}' while the success criteria has a log file of '{experiment_run_dir}/{experiment_name}.out'

Both of these technically expand to the same file, but are different strings, so each is added to the set of files that needs to be purged. And from a template expansion point of view we don't know they are the same until we expand them.

The PR I linked for Ramble will remove the repeated log files, but if you want to make a change on the benchpark side, have the saxpy success criteria just use '{log_file}'.

scheibelp commented 1 month ago

@douglasjacobsen I may be misunderstanding you: if I use Ramble commit 6e1a3254b6c6eaefb80c8beea3082175731f0ff8, and the following diff:

diff --git a/repo/saxpy/application.py b/repo/saxpy/application.py
index be6de5d..7cb0208 100644
--- a/repo/saxpy/application.py
+++ b/repo/saxpy/application.py
@@ -25,4 +25,4 @@ class Saxpy(ExecutableApplication):
     figure_of_merit('Kernel {num} size', fom_regex=r'Kernel done \((?P<num>[0-9]+)\): (?P<size>[0-9]+)', group_name='size', units='')
     figure_of_merit("success", fom_regex=r'(?P<done>Kernel done)', group_name='done', units='')

-    success_criteria('pass', mode='string', match=r'Kernel done', file='{experiment_run_dir}/{experiment_name}.out')
+    success_criteria('pass', mode='string', match=r'Kernel done', file='{log_file}')

I get the same result. I got the impression from your comment that I either needed to use https://github.com/GoogleCloudPlatform/ramble/pull/568, XOR I needed to make the above change

If I use Ramble#568 (i.e. a838e7f), then the duplication is avoided (with or without the change to saxpy); for now, I have updated the Ramble version used by Benchpark to point to that.

douglasjacobsen commented 1 month ago

@scheibelp Yeah, I would have expected that to work. But maybe something is slightly different (like maybe one of the log files is fully expanded before it gets there).

Either way, I'm glad the newer version fixes it for you.