flux-framework / flux-sched

Fluxion Graph-based Scheduler
GNU Lesser General Public License v3.0
87 stars 41 forks source link

Segfault in edge access #509

Closed cmisale closed 4 years ago

cmisale commented 5 years ago

The segmentation fault is coming up when running match on any grug file with a "big" amount of memory (tried with 64 and 128 GB). Line https://github.com/flux-framework/flux-sched/blob/59c2417112f4c6406efdfaad72309626fc790f23/resource/traversers/dfu_impl.cpp#L496-L498

is issuing the segfault because of incorrect (*egroup_i).edges[0].edge value. I've been printing out those edge values right before lines reported, trying with tiny.grug and medium-LOD in https://github.com/flux-framework/flux-sched/issues/507#issuecomment-516941486. tiny.grug output for edges is the following

[DFU_impl] edge (2,5)
[DFU_impl] edge (2,4)
[DFU_impl] edge (3,6)

which terminates correctly. When running with medium-LOD, or also using other grug files like sierra in the examples folder, the edges look like this

[DFU_impl] edge (30,95)
[DFU_impl] edge (30,759)
[DFU_impl] edge (30,94)
[DFU_impl] edge (30,758)
[DFU_impl] edge (30,93)
[DFU_impl] edge (1099830461280,1099830461280)
[DFU_impl] edge (193273528321,0)
[DFU_impl] edge (137438953473,0)
Segmentation fault (core dumped)

I'm reporting some traces from both valgrind and gdb, they're basically the same. The segfault is raised when trying to print the edge.

Output from valgrind --tool=memcheck:

==317==  Access not within mapped region at address 0xB0
==317==    at 0x10040408: operator<< (ostream:171)
==317==    by 0x10040408: operator<< <char, std::char_traits<char>, boost::directed_tag, long unsigned int> (edge.hpp:117)
==317==    by 0x10040408: Flux::resource_model::detail::dfu_impl_t::dom_slot(Flux::resource_model::detail::jobmeta_t const&, unsigned long, std::vector<Flux::Jobspec::Resource, std::allocator<Flux::Jobspec::Resource> > const&, bool, bool*, Flux::resource_model::scoring_api_t&) (dfu_impl.cpp:496)
==317==    by 0x10040963: Flux::resource_model::detail::dfu_impl_t::dom_dfv(Flux::resource_model::detail::jobmeta_t const&, unsigned long, std::vector<Flux::Jobspec::Resource, std::allocator<Flux::Jobspec::Resource> > const&, bool, bool*, Flux::resource_model::scoring_api_t&) (dfu_impl.cpp:546)
==317==    by 0x1003FB7F: Flux::resource_model::detail::dfu_impl_t::explore(Flux::resource_model::detail::jobmeta_t const&, unsigned long, std::string const&, std::vector<Flux::Jobspec::Resource, std::allocator<Flux::Jobspec::Resource> > const&, bool, bool*, Flux::resource_model::detail::visit_t, Flux::resource_model::scoring_api_t&) (dfu_impl.cpp:382)
==317==    by 0x1004005B: Flux::resource_model::detail::dfu_impl_t::dom_exp(Flux::resource_model::detail::jobmeta_t const&, unsigned long, std::vector<Flux::Jobspec::Resource, std::allocator<Flux::Jobspec::Resource> > const&, bool, bool*, Flux::resource_model::scoring_api_t&) (dfu_impl.cpp:441)
==317==    by 0x10040777: Flux::resource_model::detail::dfu_impl_t::dom_dfv(Flux::resource_model::detail::jobmeta_t const&, unsigned long, std::vector<Flux::Jobspec::Resource, std::allocator<Flux::Jobspec::Resource> > const&, bool, bool*, Flux::resource_model::scoring_api_t&) (dfu_impl.cpp:548)
==317==    by 0x1003FB7F: Flux::resource_model::detail::dfu_impl_t::explore(Flux::resource_model::detail::jobmeta_t const&, unsigned long, std::string const&, std::vector<Flux::Jobspec::Resource, std::allocator<Flux::Jobspec::Resource> > const&, bool, bool*, Flux::resource_model::detail::visit_t, Flux::resource_model::scoring_api_t&) (dfu_impl.cpp:382)
==317==    by 0x1004005B: Flux::resource_model::detail::dfu_impl_t::dom_exp(Flux::resource_model::detail::jobmeta_t const&, unsigned long, std::vector<Flux::Jobspec::Resource, std::allocator<Flux::Jobspec::Resource> > const&, bool, bool*, Flux::resource_model::scoring_api_t&) (dfu_impl.cpp:441)
==317==    by 0x10040777: Flux::resource_model::detail::dfu_impl_t::dom_dfv(Flux::resource_model::detail::jobmeta_t const&, unsigned long, std::vector<Flux::Jobspec::Resource, std::allocator<Flux::Jobspec::Resource> > const&, bool, bool*, Flux::resource_model::scoring_api_t&) (dfu_impl.cpp:548)
==317==    by 0x1003FB7F: Flux::resource_model::detail::dfu_impl_t::explore(Flux::resource_model::detail::jobmeta_t const&, unsigned long, std::string const&, std::vector<Flux::Jobspec::Resource, std::allocator<Flux::Jobspec::Resource> > const&, bool, bool*, Flux::resource_model::detail::visit_t, Flux::resource_model::scoring_api_t&) (dfu_impl.cpp:382)
==317==    by 0x1004005B: Flux::resource_model::detail::dfu_impl_t::dom_exp(Flux::resource_model::detail::jobmeta_t const&, unsigned long, std::vector<Flux::Jobspec::Resource, std::allocator<Flux::Jobspec::Resource> > const&, bool, bool*, Flux::resource_model::scoring_api_t&) (dfu_impl.cpp:441)
==317==    by 0x10040777: Flux::resource_model::detail::dfu_impl_t::dom_dfv(Flux::resource_model::detail::jobmeta_t const&, unsigned long, std::vector<Flux::Jobspec::Resource, std::allocator<Flux::Jobspec::Resource> > const&, bool, bool*, Flux::resource_model::scoring_api_t&) (dfu_impl.cpp:548)
==317==    by 0x10040A4F: Flux::resource_model::detail::dfu_impl_t::select(Flux::Jobspec::Jobspec&, unsigned long, Flux::resource_model::detail::jobmeta_t&, bool, unsigned int*) (dfu_impl.cpp:1089)

Output from gdb:

#1  Flux::resource_model::detail::dfu_impl_t::dom_slot (this=0x100e4af0, meta=..., u=<optimized out>, slot_shape=std::vector of length 2, capacity 2 = {...}, pristine=<optimized out>, excl=<optimized out>, dfu=...)
    at traversers/dfu_impl.cpp:496
#2  0x0000000010040964 in Flux::resource_model::detail::dfu_impl_t::dom_dfv (this=<optimized out>, meta=..., u=<optimized out>, resources=std::vector of length 1, capacity 1 = {...}, pristine=<optimized out>, 
    excl=0x3fffffffd910, to_parent=...) at traversers/dfu_impl.cpp:546
#3  0x000000001003fb80 in Flux::resource_model::detail::dfu_impl_t::explore (this=0x100e4af0, meta=..., u=<optimized out>, subsystem="containment", resources=std::vector of length 1, capacity 1 = {...}, 
    pristine=<optimized out>, excl=0x3fffffffdabd, direction=<optimized out>, dfu=...) at traversers/dfu_impl.cpp:382
#4  0x000000001004005c in Flux::resource_model::detail::dfu_impl_t::dom_exp (this=0x100e4af0, meta=..., u=2, resources=std::vector of length 1, capacity 1 = {...}, pristine=<optimized out>, excl=0x3fffffffdabd, dfu=...)
    at traversers/dfu_impl.cpp:441
#5  0x0000000010040778 in Flux::resource_model::detail::dfu_impl_t::dom_dfv (this=<optimized out>, meta=..., u=<optimized out>, resources=std::vector of length 1, capacity 1 = {...}, pristine=<optimized out>, 
    excl=0x3fffffffde50, to_parent=...) at traversers/dfu_impl.cpp:548
#6  0x000000001003fb80 in Flux::resource_model::detail::dfu_impl_t::explore (this=0x100e4af0, meta=..., u=<optimized out>, subsystem="containment", resources=std::vector of length 1, capacity 1 = {...}, 
    pristine=<optimized out>, excl=0x3fffffffdffd, direction=<optimized out>, dfu=...) at traversers/dfu_impl.cpp:382
#7  0x000000001004005c in Flux::resource_model::detail::dfu_impl_t::dom_exp (this=0x100e4af0, meta=..., u=1, resources=std::vector of length 1, capacity 1 = {...}, pristine=<optimized out>, excl=0x3fffffffdffd, dfu=...)
    at traversers/dfu_impl.cpp:441
#8  0x0000000010040778 in Flux::resource_model::detail::dfu_impl_t::dom_dfv (this=<optimized out>, meta=..., u=<optimized out>, resources=std::vector of length 1, capacity 1 = {...}, pristine=<optimized out>, 
    excl=0x3fffffffe390, to_parent=...) at traversers/dfu_impl.cpp:548
#9  0x000000001003fb80 in Flux::resource_model::detail::dfu_impl_t::explore (this=0x100e4af0, meta=..., u=<optimized out>, subsystem="containment", resources=std::vector of length 1, capacity 1 = {...}, 
    pristine=<optimized out>, excl=0x3fffffffe53d, direction=<optimized out>, dfu=...) at traversers/dfu_impl.cpp:382
#10 0x000000001004005c in Flux::resource_model::detail::dfu_impl_t::dom_exp (this=0x100e4af0, meta=..., u=0, resources=std::vector of length 1, capacity 1 = {...}, pristine=<optimized out>, excl=0x3fffffffe53d, dfu=...)
    at traversers/dfu_impl.cpp:441
#11 0x0000000010040778 in Flux::resource_model::detail::dfu_impl_t::dom_dfv (this=<optimized out>, meta=..., u=<optimized out>, resources=std::vector of length 1, capacity 1 = {...}, pristine=<optimized out>, 
    excl=0x3fffffffe680, to_parent=...) at traversers/dfu_impl.cpp:548
#12 0x0000000010040a50 in Flux::resource_model::detail::dfu_impl_t::select (this=0x100e4af0, j=..., root=0, meta=..., excl=<optimized out>, needs=0x3fffffffe840) at traversers/dfu_impl.cpp:1089
#13 0x0000000010039af4 in Flux::resource_model::dfu_traverser_t::schedule (this=0x100e4af0, jobspec=..., meta=..., x=<optimized out>, op=<optimized out>, root=0, needs=0x3fffffffe840, 
    dfv=std::unordered_map with 1 elements = {...}) at traversers/dfu.cpp:64
#14 0x000000001003a4b0 in Flux::resource_model::dfu_traverser_t::run (this=0x100e4af0, jobspec=..., writers=0x103bbaa0, op=<optimized out>, jobid=1, at=0x3fffffffe938) at traversers/dfu.cpp:244
#15 0x00003fffb79f2954 in Flux::resource_model::cmd_match (ctx=0x100e3890, args=std::vector of length 3, capacity 4 = {...}) at /flux-sched/resource/utilities/command.cpp:171
#16 0x000000001001fb9c in ResQWrapper::match (this=0x100e0780, subcmd="allocate", jobspec_fn="/home/job.yaml") at resqwrapper.cpp:324
#17 0x0000000010018cf0 in main () at main.cpp:10

Expanding frames:

(gdb) f 0
#0  operator<< <char, std::char_traits<char>, boost::directed_tag, long unsigned int> (e=..., e=..., os=...) at /usr/include/boost/graph/detail/edge.hpp:117
117     return os << "(" << e.m_source << "," << e.m_target << ")";
(gdb) l
112   template <class Char, class Traits, class D, class V>
113   std::basic_ostream<Char, Traits>& 
114   operator<<(std::basic_ostream<Char, Traits>& os,
115              const boost::detail::edge_desc_impl<D,V>& e)
116   {
117     return os << "(" << e.m_source << "," << e.m_target << ")";
118   }
119 #endif
120 
121 }
(gdb) f 1
#1  Flux::resource_model::detail::dfu_impl_t::dom_slot (this=0x100e4af0, meta=..., u=<optimized out>, slot_shape=std::vector of length 2, capacity 2 = {...}, pristine=<optimized out>, excl=<optimized out>, dfu=...)
    at traversers/dfu_impl.cpp:496
496                 cout << "[DFU_impl] edge " << (*egroup_i).edges[0].edge << std::endl;
(gdb) l
491             unsigned int j = 0;
492             unsigned int qc = dfu_slot.qualified_count (dom, slot_elem.type);
493             unsigned int count = m_match->calc_count (slot_elem, qc);
494             while (j < count) {
495                 auto egroup_i = dfu_slot.iter_cur (dom, slot_elem.type);
496                 cout << "[DFU_impl] edge " << (*egroup_i).edges[0].edge << std::endl;
497                 fflush(stdout);
498                 eval_edg_t ev_edg ((*egroup_i).edges[0].count,
499                                    (*egroup_i).edges[0].count, 1,
500                                    (*egroup_i).edges[0].edge);

I am running the following jobspec in both cases:

version: 1
resources:
- type: node
  count: 1
  with:
  - type: socket
    count: 1
    with:
    - type: slot
      count: 1
      label: default
      with:
      - type: core
        count: 1
      - type: memory
        count: 10
attributes:
  system:
    duration: 3600
tasks:
- command: nginx:1.12
  slot: default
  count:
    per_slot: 1

@dongahn @SteVwonder

dongahn commented 5 years ago

@cmisale: I will take a look at this on my returning flight. It may have something to do with the granularity of memory pool modeling (or something else of course).

dongahn commented 5 years ago

BTW, thank you so much for reporting these problems. Real use cases like this is the only way we can harden our software, @cmisale!

cmisale commented 5 years ago

Thanks! Please let me know if you need some other data.


Claudia Misale, PhD Research Staff Member Cognitive and Cloud Solutions Data Centric Systems IBM T. J. Watson Research Center

E-mail: c.misale@ibm.com Phone: +1 (914) 945-1693

From: "Dong H. Ahn" notifications@github.com To: flux-framework/flux-sched flux-sched@noreply.github.com Cc: Claudia Misale c.misale@ibm.com, Mention mention@noreply.github.com Date: 08/15/2019 12:42 PM Subject: [EXTERNAL] Re: [flux-framework/flux-sched] Segfault in edge access (#509)

@cmisale: I will take a look at this on my returning flight. It may have something to do with the granularity of memory pool modeling (or something else of course).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

dongahn commented 5 years ago

@cmisale:

I think I understand the problem. This is an outstraight bug in handling slot type in a jobspec.

We treat slot in a different way than any other resource type because it is the only non-physical resource type. How we handle is this is : 1) perform a DFV subtree walk for a slot type, 2) divide the resource set discovered from this DFV walk equally according to the slot shape, 3) check if the given number of slots can be satisfied. There turns out to be a bug in 2).

For example, if your jobspec is socket[1]->slot[1]->core[2] and your machine is socket[1]->core[8], then at the end of the DFV walk on the subtree rooted at a socket vertex, we will give you core: 8. Then, we divide 8 by 2 which means that we can have 4 slots. So clearly this satisfies slot[2].

Now, there is an issue in the current code. The code doesn't factor into account the granularity in which a resource pool vertex is modeled/constructed for this.

Say, each memory pool granularity is 64GB and your machine has 4 of those. In terms of quantity, your machines has socket[1]->memory[256]. But in terms of scheduleable units you only have 4 (i.e., 4 x 64GB). Now if you jobspec is socket[1]->memory[32], we will currently say you have 8 slots when in fact you only can fit 4 slots because of this granularity constraint! This leads to a buffer overflow.

I have a patch to fix this below, which I haven't tested throughly but you are welcome to try.

BTW, I noticed that the exampleSierra GRUG models that each socket has only two 64GB memory pool vertices while it has 22 core vertices. So in this case, your above jobspec will be constrained by memory rather than core.

With this granularity, the jobspec will matched with socket[1]->core[1] and ->memory[64], which will let you only schedule 4 jobs per node. I don't know if this is what you wanted, but I thought I should point out.

diff --git a/resource/evaluators/edge_eval_api.cpp b/resource/evaluators/edge_eval_api.cpp
index daacc68..4732fb3 100644
--- a/resource/evaluators/edge_eval_api.cpp
+++ b/resource/evaluators/edge_eval_api.cpp
@@ -149,6 +149,11 @@ unsigned int evals_t::qualified_count () const
     return m_qual_count;
 }

+unsigned int evals_t::qualified_granules () const
+{
+    return m_eval_egroups.size ();
+}
+
 unsigned int evals_t::total_count () const
 {
     return m_total_count;
diff --git a/resource/evaluators/edge_eval_api.hpp b/resource/evaluators/edge_eval_api.hpp
index f4b3914..b0a28ec 100644
--- a/resource/evaluators/edge_eval_api.hpp
+++ b/resource/evaluators/edge_eval_api.hpp
@@ -73,6 +73,7 @@ public:
     // This can throw out_of_range exception
     const eval_egroup_t &at (unsigned int i) const;
     unsigned int qualified_count () const;
+    unsigned int qualified_granules () const;
     unsigned int total_count () const;
     int64_t cutline () const;
     int64_t set_cutline (int64_t cutline);
diff --git a/resource/evaluators/scoring_api.cpp b/resource/evaluators/scoring_api.cpp
index e04ce62..019480e 100644
--- a/resource/evaluators/scoring_api.cpp
+++ b/resource/evaluators/scoring_api.cpp
@@ -180,6 +180,14 @@ unsigned int scoring_api_t::qualified_count (const subsystem_t &s,
     return res_evals->qualified_count ();
 }

+unsigned int scoring_api_t::qualified_granules (const subsystem_t &s,
+                                                const std::string &r)
+{
+    handle_new_keys (s, r);
+    auto res_evals = (*m_ssys_map[s])[r];
+    return res_evals->qualified_granules ();
+}
+
 unsigned int scoring_api_t::total_count (const subsystem_t &s,
                                          const std::string &r)
 {
diff --git a/resource/evaluators/scoring_api.hpp b/resource/evaluators/scoring_api.hpp
index 2689c8d..34113db 100644
--- a/resource/evaluators/scoring_api.hpp
+++ b/resource/evaluators/scoring_api.hpp
@@ -58,6 +58,7 @@ public:
     const eval_egroup_t &at (const subsystem_t &s, const std::string &r,
                              unsigned int i);
     unsigned int qualified_count (const subsystem_t &s, const std::string &r);
+    unsigned int qualified_granules (const subsystem_t &s, const std::string &r);
     unsigned int total_count (const subsystem_t &s, const std::string &r);
     unsigned int best_k (const subsystem_t &s, const std::string &r);
     unsigned int best_i (const subsystem_t &s, const std::string &r);
diff --git a/resource/traversers/dfu_impl.cpp b/resource/traversers/dfu_impl.cpp
index 8b8776a..65d3578 100644
--- a/resource/traversers/dfu_impl.cpp
+++ b/resource/traversers/dfu_impl.cpp
@@ -451,17 +451,29 @@ int dfu_impl_t::cnt_slot (const vector<Resource> &slot_shape,
                           scoring_api_t &dfu_slot)
 {
     unsigned int qc = 0;
+    unsigned int qg = 0;
     unsigned int fit = 0;
     unsigned int count = 0;
     unsigned int qual_num_slots = UINT_MAX;
     const subsystem_t &dom = m_match->dom_subsystem ();

     // qualifed slot count is determined by the most constrained resource type
+    // both in terms of the amounts available as well as the number of edges into
+    // that resource because that represent the match granularity.
+    // Say, you have 128 units of memory available across two memory resource
+    // vertices each with 64 units of memory and you request 1 unit of memory.
+    // In this case, you don't have 128 slots available because the match
+    // granularity is 64 units. Instead, you have only 2 slots available each
+    // with 64 units, and your request will get 1 whole resource vertex.
     qual_num_slots = UINT_MAX;
     for (auto &slot_elem : slot_shape) {
         qc = dfu_slot.qualified_count (dom, slot_elem.type);
+        qg = dfu_slot.qualified_granules (dom, slot_elem.type);
         count = m_match->calc_count (slot_elem, qc);
+        // constraint check against qualified amounts
         fit = (count == 0)? count : (qc / count);
+        // constraint check against qualified granules
+        fit = (fit > qg)? qg : fit;
         qual_num_slots = (qual_num_slots > fit)? fit : qual_num_slots;
         dfu_slot.rewind_iter_cur (dom, slot_elem.type);
     }
cmisale commented 5 years ago

@dongahn you nailed it! I have incorporated the changes and it is working good. I didn't get any segmentation fault so far. I am doing now a deeper testing to confirm that. Thanks

dongahn commented 5 years ago

@cmisale: Great! Yeah let me know if you see any other problem, this is really useful for me given the mature of this software.

cmisale commented 5 years ago

No more errors, we can close it

dongahn commented 5 years ago

The patch hasn't made the master but I will make sure to create a PR soonish.

milroy commented 4 years ago

@dongahn: @cmisale noted that the changes described here haven't been integrated into master yet, so I'm reopening this issue so we don't forget.

milroy commented 4 years ago

I'm re-closing this issue since the fix was integrated in merged PR #548.