Closed milroy closed 1 year ago
Thanks @cmisale! I'm going to update a few things and then we can merge it soon.
Only 45K lines of yaml! :laughing: :sob:
Are there instructions for how to use the containers outside of the MPI operator (maybe a TODO for a separate repository with very basic examples of running flux?) (understanding that this repository is scoped to reproduce your paper work!)
Are there instructions for how to use the containers outside of the MPI operator (maybe a TODO for a separate repository with very basic examples of running flux?) (understanding that this repository is scoped to reproduce your paper work!)
After a bit more thought those build and run instructions (which I still need to write) are better for the MPI Operator repo. I'll get a PR ready for that.
@cmisale: how should we go about replacing kubeflux
with fluence
in the manifests, etc.? I assume a recursive replace will break KubeFlux, right?
Just be careful with the .git directory for doing it on the command line - I have broke many a git directory with this kind of substitution!
If you have VSCode they have a nice find and replace that will show you the places before doing it.
Just be careful with the .git directory for doing it on the command line - I have broke many a git directory with this kind of substitution!
Excellent point; noted.
@cmisale: how should we go about replacing
kubeflux
withfluence
in the manifests, etc.? I assume a recursive replace will break KubeFlux, right?
@milroy yeah it would break a few things. I would need to 1. change the container names and 2. change the hard coded name.
Should I do that now?
Should I do that now?
I was about to say yes, but just considered something. All the data that I put in this PR references "KubeFlux". Should I leave the references in the data? If so, plotting (and job management, eventually) will need to change, too. For example, in the Python plotting scripts, we have:
sns.boxplot(
x="ranks",
y="real",
hue="scheduler",
data=df,
whis=[5, 95],
palette={"default-scheduler": "#4878d0", "kubeflux": "#ee854a"},
)
default-scheduler
and kubeflux
are keys read from the raw output data.
I think a simple check in the plotting scripts to check for kubeflux
or fluence
scheduler names will solve this. I'll try this later today and update you, @cmisale.
Ok, I've cleaned up the plotting scripts and ensured they work. @cmisale go ahead and make the two changes you mentioned above.
@milroy sounds good! I should be able to complete it by today
Pretty name showing up
claudias-air:charts cmisale$ k get po -n scheduler-plugins
NAME READY STATUS RESTARTS AGE
fluence-bc9758657-4m57b 2/2 Running 0 65m
scheduler-plugins-controller-f5cdf9674-8ptsr 1/1 Running 0 65m
Thank you @cmisale!
I added a fixup commit with the changes to the repo. I'll test the combined changes on EKS ASAP and let you know what happens.
It works:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 14s fluence Successfully assigned default/lammps-4661ea379f0b-worker-0 to ip-192-168-41-26.us-east-2.compute.internal
Normal Pulling 14s kubelet Pulling image "milroy1/kf-testing:lammps-focal-openmpi-4.1.2-amd-efa"
Normal Pulled 13s kubelet Successfully pulled image "milroy1/kf-testing:lammps-focal-openmpi-4.1.2-amd-efa" in 275.853816ms
Normal Created 13s kubelet Created container worker
Normal Started 13s kubelet Started container worker
run_experiments.py
works as expected, too.
I'm merging this PR. Thank you very much @cmisale!
Great!!
YAMLs, dockerfiles, Python scripts, JSONs and output data necessary to reproduce our CANOPIE 2022 paper: One Step Closer to Converged Computing: Achieving Scalability with Cloud-Native HPC.