flux-framework / flux-k8s

Project to manage Flux tasks needed to standardize kubernetes HPC scheduling interfaces
Apache License 2.0
20 stars 10 forks source link

testing: gke then eks #67

Closed vsoch closed 5 months ago

vsoch commented 5 months ago

I am making small changes as I test on GKE and EKS. My first tests on GKE had me creating / deleting jobs, and I think the state of fluence (fluxion) got out of sync with the jobs, meaning that fluxion thought jobs were running that were not and then was unable to allocate new ones. To adjust for that we can add back in the cancel response, but this will only work given that fluence has not lost memory of the job id. We likely need an approach that can either save the jobids to the state data (that could be reloaded) or a way to inspect jobs explicitly and purge, OR (better) a way to look up a job not based on the id, but based on the group id (the command in the jobspec). That way, regardless of a jobid, we could lose all of our state and still find the old (stale) job to delete. With a fresh state and larger cluster I am able to run jobs on GKE, but they are enormously slow - lammps size 2 2 2 is taking over 20 minutes. This is not the fault of fluence - GKE networking sucks. To keep debugging I likely need to move over to AWS with EFA, of course that introduces more things to figure out like EFA, etc.

vsoch commented 5 months ago

Still having some trouble - I think now because the networking in GKE is abysmal, and there is still an issue of state in our operator. But I was able to get a few runs in and at least get a rough comparison.

I think likely next we want to get this running on EKS (so the network isn't an issue) and think harder about the state (jobid mapping, primarily) problem. lammps-total-times_lammps-total-times

I also think there is an issue with fluence not properly seeing the resources being used by other things (not installed with it, which is most of the stuff on the node, which happens at smaller node sizes, hence why I increased size for this test) and likely we need an ability to get a more full picture of what is running on the cluster (and update it, that might also help with our current "getting stale" problem).