flux-framework / flux-sched

Fluxion Graph-based Scheduler
GNU Lesser General Public License v3.0
84 stars 39 forks source link

fluixion assigns jobs requesting unlimited duration an expiration time > instance expiration #1103

Closed garlick closed 6 months ago

garlick commented 7 months ago

Problem: when a job is submitted with unlimited duration, fluxion assigns it an expiration time that is (start time + instance duration), which does not account for the time elapsed between instance start and job start. As a result, the job's expiration time is after the instance is no longer running.

$ flux alloc -t 30m -N 1
ƒ(s=1,d=1) $ flux kvs get resource.R | jq '.execution | .expiration - .starttime'
1800
ƒ(s=1,d=1) $ flux submit sleep inf
ƒZ5R5knb
ƒ(s=1,d=1) $ flux job info ƒZ5R5knb R | jq '.execution | .expiration - .starttime'
1800
# Wait a few minutes
ƒ(s=1,d=1) $ flux submit sleep inf
ƒ2oCAKVu9
ƒ(s=1,d=1) $ flux job info ƒ2oCAKVu9 R  | jq '.execution | .expiration - .starttime'
1800
grondo commented 7 months ago

I think the problem is likely here:

https://github.com/flux-framework/flux-sched/blob/310cc2de216d8cee16e98bae8c816f203302a6d9/resource/traversers/dfu_impl.hpp#L73-L74

https://github.com/flux-framework/flux-sched/blob/310cc2de216d8cee16e98bae8c816f203302a6d9/resource/traversers/dfu_impl.hpp#L89-L92

If a jobspec does not have a duration set, then the duration is set to the whole graph duration instead of the remaining time. A possible fix, though I don't know what I'm doing:

diff --git a/resource/traversers/dfu_impl.hpp b/resource/traversers/dfu_impl.hpp
index 23048c01..7ef64948 100644
--- a/resource/traversers/dfu_impl.hpp
+++ b/resource/traversers/dfu_impl.hpp
@@ -70,8 +70,9 @@ struct jobmeta_t {
         now = t;
         jobid = id;
         alloc_type = alloc;
+        const auto now = std::chrono::system_clock::now();
         int64_t g_duration = std::chrono::duration_cast<std::chrono::seconds>
-            (graph_duration.graph_end - graph_duration.graph_start).count ();
+            (graph_duration.graph_end - now).count ();

         if (g_duration <= 0) {
             errno = EINVAL;