flux-framework / flux-sched

Fluxion Graph-based Scheduler
GNU Lesser General Public License v3.0
89 stars 41 forks source link

qmanager: add get-stats and clear-stats callbacks #1265

Closed trws closed 3 months ago

trws commented 3 months ago

problem: it's hard to get information about the current state of queues withotu debugging fluxion

solution: add a stats callback that prints information about each queue including:

This does introduce a dependency on boost::json, which technically requires a higher version of boost. I think we already require a higher version for other reasons, but version 1.78 is already installed on TOSS4 from EPEL. May need to tweak the rhel8 container, will see. Here is some example output from a test where there are 10 jobs blocked on a drained node and two jobs that have completed:

{
  "default": {
    "policy": "easy",
    "queue_depth": 32,
    "max_queue_depth": 1000000,
    "queue_parameters": {},
    "policy_parameters": {},
    "action counts": {
      "pending": 12,
      "running": 2,
      "reserved": 0,
      "rejected": 0,
      "complete": 2,
      "cancelled": 0,
      "reprioritized": 0
    },
    "pending_queues": {
      "pending": [],
      "pending_provisional": [],
      "blocked": [
        "fYDBt63",
        "fYEfsNP",
        "fYEfsNQ",
        "fYG9rej",
        "fYG9rek",
        "fYG9rem",
        "fYHdqw5",
        "fYHdqw6",
        "fYHdqw7",
        "fYK7qDR"
      ]
    },
    "scheduled_queues": {
      "running": [],
      "rejected": [],
      "canceled": []
    }
  }
}

There is plenty more information we could surface here, including the timestamps of jobs moving between states, current state each job considers itself to be in, etc. One thing I think we definitely want is something indicating partial release status, but we're currently not storing it so I'd rather do that after that gets worked out for final release.

trws commented 3 months ago

Ok, this took far far too long to rework to use jansson. I think I ended up spending 5 hours figuring out what was wrong with my format strings and working out the one-off jansson interfaces to make this work, it might have been more. This at least gets the thing working, and starts to give us something to manage jansson values and build up serializers. It also fixes the memory leaks in resource from unfreed jansson objects is taken care of.

Hopefully this is good enough for the release, and I'll spend a bit more time to make something that has the uniform interface of object and array-like things in C++, because this feels like a loss:

     {
-        using namespace boost::json;
-        auto &obj = jv.emplace_object ();
-        obj = {{"policy", this->policy ()},
-               {"queue_depth", m_queue_depth},
-               {"max_queue_depth", m_max_queue_depth},
-               {"queue_parameters", value_from (m_qparams)},
-               {"policy_parameters", value_from (m_pparams)},
-               {"action counts",
-                {{"pending", m_pq_cnt},
-                 {"running", m_rq_cnt},
-                 {"reserved", m_oq_cnt},
-                 {"rejected", m_dq_cnt},
-                 {"complete", m_cq_cnt},
-                 {"cancelled", m_cancel_cnt},
-                 {"reprioritized", m_reprio_cnt}}}};
+        json::value qparams;
+        to_json (qparams, m_qparams);
+        json::value pparams;
+        to_json (pparams, m_pparams);
         char buf[128] = {};
-        auto add_queue = [&] (::boost::json::array &a, auto &map) {
+        auto add_queue = [&] (json_t *a, auto &map) {
             for (auto &[k, jobid] : map) {
                 if (flux_job_id_encode (jobid, "f58plain", buf, sizeof buf) < 0)
-                    a.push_back (jobid);
+                    json_array_append_new (a, json_integer (jobid));
                 else
-                    a.push_back (buf);
+                    json_array_append_new (a, json_string (buf));
             }
         };
-        auto &pending_queues = obj["pending_queues"].emplace_object ();
-        add_queue (pending_queues["pending"].emplace_array (), m_pending);
-        add_queue (pending_queues["pending_provisional"].emplace_array (), m_pending_provisional);
-        add_queue (pending_queues["blocked"].emplace_array (), m_blocked);
-        auto &scheduled_queues = obj["scheduled_queues"].emplace_object ();
-        add_queue (scheduled_queues["running"].emplace_array (), m_running);
-        add_queue (scheduled_queues["rejected"].emplace_array (), m_rejected);
-        add_queue (scheduled_queues["canceled"].emplace_array (), m_canceled);
+        json::value pending;
+        pending.emplace_object ();
+        json::value pending_arr;
+        pending_arr.emplace_array ();
+        json_object_set (pending.get (), "pending", pending_arr.get ());
+        add_queue (pending_arr.get (), m_pending);
+        pending_arr.emplace_array ();
+        json_object_set (pending.get (), "pending_provisional", pending_arr.get ());
+        add_queue (pending_arr.get (), m_pending_provisional);
+        pending_arr.emplace_array ();
+        json_object_set (pending.get (), "blocked", pending_arr.get ());
+        add_queue (pending_arr.get (), m_blocked);
+
+        json::value scheduled;
+        scheduled.emplace_object ();
+        json::value scheduled_arr;
+        scheduled_arr.emplace_array ();
+        json_object_set (scheduled.get (), "running", scheduled_arr.get ());
+        add_queue (scheduled_arr.get (), m_running);
+        scheduled_arr.emplace_array ();
+        json_object_set (scheduled.get (), "rejected", scheduled_arr.get ());
+        add_queue (scheduled_arr.get (), m_rejected);
+        scheduled_arr.emplace_array ();
+        json_object_set (scheduled.get (), "canceled", scheduled_arr.get ());
+        add_queue (scheduled_arr.get (), m_canceled);
+
+        json_error_t err = {0};
+        jv = json::value (json::no_incref{},
+                          json_pack_ex (&err,
+                                        0,
+                                        // begin object
+                                        "{"
+                                        // policy
+                                        "s:s%"
+                                        // queue_depth
+                                        "s:I"
+                                        // max_queue_depth
+                                        "s:I"
+                                        // queue parameters
+                                        "s:O"
+                                        // policy parameters
+                                        "s:O"
+                                        // action counts
+                                        "s:o"
+                                        // pending queues
+                                        "s:O"
+                                        // scheduled queues
+                                        "s:O"
+                                        // end object
+                                        "}",
+                                        // VALUE START
+                                        // policy, str+length style
+                                        "policy",
+                                        this->policy ().data (),
+                                        this->policy ().length (),
+                                        // queue_depth
+                                        "queue_depth",
+                                        (json_int_t)m_queue_depth,
+                                        // max_queue_depth
+                                        "max_queue_depth",
+                                        (json_int_t)m_max_queue_depth,
+                                        // queue parameters
+                                        "queue_parameters",
+                                        qparams.get (),
+                                        // policy parameters
+                                        "policy_parameters",
+                                        pparams.get (),
+                                        // action counts
+                                        "action_counts",
+                                        json_pack ("{s:I s:I s:I s:I s:I s:I s:I}",
+                                                   "pending",
+                                                   m_pq_cnt,
+                                                   "running",
+                                                   m_rq_cnt,
+                                                   "reserved",
+                                                   m_oq_cnt,
+                                                   "rejected",
+                                                   m_dq_cnt,
+                                                   "complete",
+                                                   m_cq_cnt,
+                                                   "cancelled",
+                                                   m_cancel_cnt,
+                                                   "reprioritized",
+                                                   m_reprio_cnt),
+                                        // pending queues
+                                        "pending_queues",
+                                        pending.get (),
+                                        // scheduled queues
+                                        "scheduled_queues",
+                                        scheduled.get ()));
+        if (!jv.get ()) {
+            throw std::runtime_error (err.text);
+        }
     }
trws commented 3 months ago

I think we should test the other JSON libraries you listed to see if they have a substantial impact on Fluxion performance. I suspect partial cancel can be accelerated with a faster JSON library. I also wonder whether JSON manipulation will become a bottleneck with large JGFs.

I'll open another issue or discussion and we can look over some benchmarks/options and tradeoffs. Thanks for reviewing this, and I'm sorry all for being a bit grumpy about this, I had everything basically working before realizing the CI is missing jammy and having to rework it all. Hopefully the stats will be useful.

codecov[bot] commented 1 week ago

Codecov Report

Attention: Patch coverage is 86.02941% with 19 lines in your changes missing coverage. Please review.

Project coverage is 75.5%. Comparing base (8b2cb13) to head (fdf1038). Report is 124 commits behind head on master.

Files with missing lines Patch % Lines
qmanager/modules/qmanager_callbacks.cpp 66.6% 9 Missing :warning:
resource/modules/resource_match.cpp 58.3% 5 Missing :warning:
qmanager/policies/base/queue_policy_base.hpp 94.5% 3 Missing :warning:
qmanager/modules/qmanager.cpp 50.0% 2 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #1265 +/- ## ======================================== + Coverage 75.4% 75.5% +0.1% ======================================== Files 107 111 +4 Lines 15219 15331 +112 ======================================== + Hits 11487 11587 +100 - Misses 3732 3744 +12 ``` | [Files with missing lines](https://app.codecov.io/gh/flux-framework/flux-sched/pull/1265?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework) | Coverage Δ | | |---|---|---| | [qmanager/policies/queue\_policy\_bf\_base\_impl.hpp](https://app.codecov.io/gh/flux-framework/flux-sched/pull/1265?src=pr&el=tree&filepath=qmanager%2Fpolicies%2Fqueue_policy_bf_base_impl.hpp&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework#diff-cW1hbmFnZXIvcG9saWNpZXMvcXVldWVfcG9saWN5X2JmX2Jhc2VfaW1wbC5ocHA=) | `80.9% <ø> (-0.2%)` | :arrow_down: | | [qmanager/policies/queue\_policy\_conservative.hpp](https://app.codecov.io/gh/flux-framework/flux-sched/pull/1265?src=pr&el=tree&filepath=qmanager%2Fpolicies%2Fqueue_policy_conservative.hpp&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework#diff-cW1hbmFnZXIvcG9saWNpZXMvcXVldWVfcG9saWN5X2NvbnNlcnZhdGl2ZS5ocHA=) | `100.0% <100.0%> (ø)` | | | [...anager/policies/queue\_policy\_conservative\_impl.hpp](https://app.codecov.io/gh/flux-framework/flux-sched/pull/1265?src=pr&el=tree&filepath=qmanager%2Fpolicies%2Fqueue_policy_conservative_impl.hpp&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework#diff-cW1hbmFnZXIvcG9saWNpZXMvcXVldWVfcG9saWN5X2NvbnNlcnZhdGl2ZV9pbXBsLmhwcA==) | `75.0% <ø> (-1.0%)` | :arrow_down: | | [qmanager/policies/queue\_policy\_easy.hpp](https://app.codecov.io/gh/flux-framework/flux-sched/pull/1265?src=pr&el=tree&filepath=qmanager%2Fpolicies%2Fqueue_policy_easy.hpp&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework#diff-cW1hbmFnZXIvcG9saWNpZXMvcXVldWVfcG9saWN5X2Vhc3kuaHBw) | `100.0% <100.0%> (ø)` | | | [qmanager/policies/queue\_policy\_fcfs.hpp](https://app.codecov.io/gh/flux-framework/flux-sched/pull/1265?src=pr&el=tree&filepath=qmanager%2Fpolicies%2Fqueue_policy_fcfs.hpp&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework#diff-cW1hbmFnZXIvcG9saWNpZXMvcXVldWVfcG9saWN5X2ZjZnMuaHBw) | `100.0% <100.0%> (ø)` | | | [qmanager/policies/queue\_policy\_fcfs\_impl.hpp](https://app.codecov.io/gh/flux-framework/flux-sched/pull/1265?src=pr&el=tree&filepath=qmanager%2Fpolicies%2Fqueue_policy_fcfs_impl.hpp&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework#diff-cW1hbmFnZXIvcG9saWNpZXMvcXVldWVfcG9saWN5X2ZjZnNfaW1wbC5ocHA=) | `72.9% <ø> (-0.4%)` | :arrow_down: | | [qmanager/policies/queue\_policy\_hybrid.hpp](https://app.codecov.io/gh/flux-framework/flux-sched/pull/1265?src=pr&el=tree&filepath=qmanager%2Fpolicies%2Fqueue_policy_hybrid.hpp&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework#diff-cW1hbmFnZXIvcG9saWNpZXMvcXVldWVfcG9saWN5X2h5YnJpZC5ocHA=) | `100.0% <100.0%> (ø)` | | | [src/common/c++wrappers/jansson.hpp](https://app.codecov.io/gh/flux-framework/flux-sched/pull/1265?src=pr&el=tree&filepath=src%2Fcommon%2Fc%2B%2Bwrappers%2Fjansson.hpp&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework#diff-c3JjL2NvbW1vbi9jKyt3cmFwcGVycy9qYW5zc29uLmhwcA==) | `100.0% <100.0%> (ø)` | | | [qmanager/modules/qmanager.cpp](https://app.codecov.io/gh/flux-framework/flux-sched/pull/1265?src=pr&el=tree&filepath=qmanager%2Fmodules%2Fqmanager.cpp&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework#diff-cW1hbmFnZXIvbW9kdWxlcy9xbWFuYWdlci5jcHA=) | `73.2% <50.0%> (-0.3%)` | :arrow_down: | | [qmanager/policies/base/queue\_policy\_base.hpp](https://app.codecov.io/gh/flux-framework/flux-sched/pull/1265?src=pr&el=tree&filepath=qmanager%2Fpolicies%2Fbase%2Fqueue_policy_base.hpp&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework#diff-cW1hbmFnZXIvcG9saWNpZXMvYmFzZS9xdWV1ZV9wb2xpY3lfYmFzZS5ocHA=) | `79.6% <94.5%> (+1.7%)` | :arrow_up: | | ... and [2 more](https://app.codecov.io/gh/flux-framework/flux-sched/pull/1265?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework) | |