Canopie22 artifacts for reproducibility of our work

milroy commented 1 year ago

YAMLs, dockerfiles, Python scripts, JSONs and output data necessary to reproduce our CANOPIE 2022 paper: One Step Closer to Converged Computing: Achieving Scalability with Cloud-Native HPC.

milroy commented 1 year ago

Thanks @cmisale! I'm going to update a few things and then we can merge it soon.

vsoch commented 1 year ago

Only 45K lines of yaml! :laughing: :sob:

Are there instructions for how to use the containers outside of the MPI operator (maybe a TODO for a separate repository with very basic examples of running flux?) (understanding that this repository is scoped to reproduce your paper work!)

milroy commented 1 year ago

Are there instructions for how to use the containers outside of the MPI operator (maybe a TODO for a separate repository with very basic examples of running flux?) (understanding that this repository is scoped to reproduce your paper work!)

After a bit more thought those build and run instructions (which I still need to write) are better for the MPI Operator repo. I'll get a PR ready for that.

milroy commented 1 year ago

@cmisale: how should we go about replacing kubeflux with fluence in the manifests, etc.? I assume a recursive replace will break KubeFlux, right?

vsoch commented 1 year ago

Just be careful with the .git directory for doing it on the command line - I have broke many a git directory with this kind of substitution!

If you have VSCode they have a nice find and replace that will show you the places before doing it.

milroy commented 1 year ago

Just be careful with the .git directory for doing it on the command line - I have broke many a git directory with this kind of substitution!

Excellent point; noted.

cmisale commented 1 year ago

@cmisale: how should we go about replacing kubeflux with fluence in the manifests, etc.? I assume a recursive replace will break KubeFlux, right?

@milroy yeah it would break a few things. I would need to 1. change the container names and 2. change the hard coded name.

Should I do that now?

milroy commented 1 year ago

Should I do that now?

I was about to say yes, but just considered something. All the data that I put in this PR references "KubeFlux". Should I leave the references in the data? If so, plotting (and job management, eventually) will need to change, too. For example, in the Python plotting scripts, we have:

        sns.boxplot(
            x="ranks",
            y="real",
            hue="scheduler",
            data=df,
            whis=[5, 95],
            palette={"default-scheduler": "#4878d0", "kubeflux": "#ee854a"},
        )

default-scheduler and kubeflux are keys read from the raw output data.

milroy commented 1 year ago

I think a simple check in the plotting scripts to check for kubeflux or fluence scheduler names will solve this. I'll try this later today and update you, @cmisale.

milroy commented 1 year ago

Ok, I've cleaned up the plotting scripts and ensured they work. @cmisale go ahead and make the two changes you mentioned above.

cmisale commented 1 year ago

@milroy sounds good! I should be able to complete it by today

cmisale commented 1 year ago

Sidecar: git@github.com:flux-framework/flux-k8s.git updated, along with the tag
Fluence: git@github.com:openshift-psap/scheduler-plugins.git updated
New container image release for Fluence sidecar: quay.io/cmisale1/fluence-sidecar:latest
New container image release for Fluence plugin scheduler: quay.io/cmisale1/fluence:upstream
Tested on a Kubernetes cluster with dummy workloads, installed through helm

Pretty name showing up

claudias-air:charts cmisale$ k get po -n scheduler-plugins 
NAME                                           READY   STATUS    RESTARTS   AGE
fluence-bc9758657-4m57b                        2/2     Running   0          65m
scheduler-plugins-controller-f5cdf9674-8ptsr   1/1     Running   0          65m

milroy commented 1 year ago

Thank you @cmisale!

I added a fixup commit with the changes to the repo. I'll test the combined changes on EKS ASAP and let you know what happens.

milroy commented 1 year ago

It works:

Events:
  Type    Reason     Age   From     Message
  ----    ------     ----  ----     -------
  Normal  Scheduled  14s   fluence  Successfully assigned default/lammps-4661ea379f0b-worker-0 to ip-192-168-41-26.us-east-2.compute.internal
  Normal  Pulling    14s   kubelet  Pulling image "milroy1/kf-testing:lammps-focal-openmpi-4.1.2-amd-efa"
  Normal  Pulled     13s   kubelet  Successfully pulled image "milroy1/kf-testing:lammps-focal-openmpi-4.1.2-amd-efa" in 275.853816ms
  Normal  Created    13s   kubelet  Created container worker
  Normal  Started    13s   kubelet  Started container worker

run_experiments.py works as expected, too.

milroy commented 1 year ago

I'm merging this PR. Thank you very much @cmisale!

cmisale commented 1 year ago

Great!!

flux-framework / flux-k8s

Canopie22 artifacts for reproducibility of our work #31