compspec / compspec-go

Prototype compatibility plugin in go for testing compspec descriptive metadata
MIT License
0 stars 0 forks source link

LAMMPS Experiment needs #13

Closed vsoch closed 6 months ago

vsoch commented 7 months ago

The above were addressed with #14

I'll close this issue after I've tested the approaches below. For running an experiment we want to be able to:

  1. Request for a match given some parameters, either just the architecture, or the architecture + extra metadata. To start this will likely be manual, e.g., I'll run a command to get the specifics of my host, and then provide those directly to compspec match. I don't want to over-engineer that, but (at some point) that should be something related to the scheduler.
  2. Ask to randomly shuffle all matches: the more specific things we ask for, the smaller this set will be. This should emulate randomly selecting from either a poorly matched set (basic mode, just the architecture) or a well-matched set (asking for more).
  3. Populate the result of that into either a job (to run with flux and a singularity container) OR generate a minicluster CRD to run the job and pipe output to the log.
  4. Run the job, either report success or failure, and a wall time

I'm undecided about step 3. If we create one cluster, then we might be dealing with singularity and somehow matching MPI on the host (but ideally I'd rather not have MPI on the host). I actually don't remember the nuances of singularity with MPI and flux - it could just work with the singularity container having MPI and flux, and (of course) we don't have some nice vendor provided MPI on the host to bind to that would give better network (the network will typically suck on Google Cloud and that's out of our control). If we do the second, then we need to populate a minicluster.yaml dynamically for the container in question, and the only subtle difference is the entrypoint command, as some with gpu use lmp_gpu and I'm sure there is a build somewhere in there with spack! It wouldn't be so terrible to have a few templates, but it seems like more work than just running a container. That said, it would be more cloud native / batch friendly than needing to create a single MiniCluster. I can see benefits for both approaches for future work, either extending to an actual cluster (the singularity approach is ideal), or to a Kubernetes scheduler or Usernetes or batch approach (the second is better).

I also keep hitting this situation where I want to "launch jobs in Kubernetes" using the Flux Operator, and although I designed flux-cloud for our first Kubecon runs, I already dislike it - the design isn't good enough for a more robust tool that knows how to launch and manage jobs. I think I'm eventually going to want to build something better, so will think about that. To start, I'm going to do the updates to compspec-go in the list above, then likely test out a bunch of random stuff. yolo!

vsoch commented 6 months ago

These first experiments are done! Woot! :partying_face: