I'm trying to understand the granularity that we get with using the metav1.Time, because (based on what I'm seeing) it seems like when I submit a huge batch of jobs with multiprocessing (likely in the same second) we get interleaving. I can't think of another reason that we'd get blocking, and consistently for both default and fluence, when the cluster size is close to the job size (or the ratio is about 1/2, so one large job could take up half resources). For example, I noticed this issue here https://github.com/tilt-dev/tilt/pull/4313 that mentions some APIs are using time.Time(), which (according to the issue) has second granularity. Their fix was to use time.MicroTime. Specifically:
Currently, metav1.Time is only stored with second-level granularity, which is probably not sufficient for this API.
And indeed the PodGroup is using metav1.Time as we can see is defined here which wraps here again. I think probably if we want to handle this "spamming the scheduler" case (and not screw up the sort) we also need to use https://github.com/kubernetes/apimachinery/blob/02a41040d88da08de6765573ae2b1a51f424e1ca/pkg/apis/meta/v1/micro_time.go#L31. This also means the PodGroup abstraction is going to have that bug, and (I think) it wasn't an issue before with launching just 3-5 jobs. What I probably should do is create a new branch off of my current development one, and restore some of the cache logic that I was working on with an internal PodGroup, and test a very simple (stupid) approach to create a MicroTime at the first time that I see a group go through sort. If that resolves the interleaving, we can be more confident it's related to time. I ran out of extra credits today but should be able to test this locally with kind (I was seeing interleaving there, why I abandoned the experimental design in the first place!)
I'm trying to understand the granularity that we get with using the metav1.Time, because (based on what I'm seeing) it seems like when I submit a huge batch of jobs with multiprocessing (likely in the same second) we get interleaving. I can't think of another reason that we'd get blocking, and consistently for both default and fluence, when the cluster size is close to the job size (or the ratio is about 1/2, so one large job could take up half resources). For example, I noticed this issue here https://github.com/tilt-dev/tilt/pull/4313 that mentions some APIs are using time.Time(), which (according to the issue) has second granularity. Their fix was to use time.MicroTime. Specifically:
And indeed the PodGroup is using metav1.Time as we can see is defined here which wraps here again. I think probably if we want to handle this "spamming the scheduler" case (and not screw up the sort) we also need to use https://github.com/kubernetes/apimachinery/blob/02a41040d88da08de6765573ae2b1a51f424e1ca/pkg/apis/meta/v1/micro_time.go#L31. This also means the PodGroup abstraction is going to have that bug, and (I think) it wasn't an issue before with launching just 3-5 jobs. What I probably should do is create a new branch off of my current development one, and restore some of the cache logic that I was working on with an internal PodGroup, and test a very simple (stupid) approach to create a MicroTime at the first time that I see a group go through sort. If that resolves the interleaving, we can be more confident it's related to time. I ran out of extra credits today but should be able to test this locally with kind (I was seeing interleaving there, why I abandoned the experimental design in the first place!)