Object load modeling in place of implicit short-time persistence

PhilMiller commented 4 years ago

Presently, the load balancing infrastructure implicitly uses a shallow persistence model of object loads in which the immediate preceding step directly predicts subsequent steps as far forward as the next call to LB.

There are two issues with this:

Theoretically and empirically, we know that this model has deficiencies, even as it works quite well in practice
Architecturally, it should be possible to run load balancing strategies with an explicitly computed prediction of the future load to be balanced from an arbitrary model. The particularly offers an opportunity for application-specific code to provide clearer signals to the LB system, without having to lock in specific LB algorithms.

What Needs to be Done?

Is your feature request related to a problem? Please describe.

Describe potential solution outcome

Describe alternatives you've considered

Additional context

PhilMiller commented 4 years ago

(will edit for further detail as I finish up other things I have in the air)

lifflander commented 4 years ago

@ppebay Can you have a look at this issue?

PhilMiller commented 4 years ago

I was going to fill in a lot more detail here, and then tag people, but that's fine.

On Wed, Nov 20, 2019, 3:24 PM Jonathan Lifflander notifications@github.com wrote:

@ppebay https://github.com/ppebay Can you have a look at this issue?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/DARMA-tasking/vt/issues/582?email_source=notifications&email_token=AAA64ADGLKV2UQ3IXKN54ATQUWMG7A5CNFSM4JPVX64KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEUTBWA#issuecomment-556347608, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA64ABJPADGOZ5NHKMEJUTQUWMG7ANCNFSM4JPVX64A .

PhilMiller commented 4 years ago

Architecturally, I believe the right design is one in which strategies take a future load model as an argument instead of an LB statistics database. They would be able to query that model for expected object behaviors (work, communication, memory footprint, etc) for specific phases in the future.

The obvious first model to implement is a naive one-phase persistence model, that just returns whatever is in the database regarding the immediately preceding phase, matching current behavior.

PhilMiller commented 4 years ago

In this perspective of the world, besides the load model, there's a few things that are still implicitly baked into the balancing strategy. In optimization terms, they would be the objective function and any constraints.

The naive objective function that we're using now is a minimization of a rough estimate of execution time, obtained by summing the loads on each process, and taking the max across processes. What we actually want is a summation of maxes across processes for each (sub)phase between when the strategy is running and when it will run next.

PhilMiller commented 4 years ago

The most obvious constraint that one might want to impose as such would be an upper-bound on memory footprint.

Most other 'constraints' that I can think of are more easily expressed in the load model or its input, such as certain objects being 'pinned' on particular processes

PhilMiller commented 4 years ago

There's a sticky point that gets glossed over in the current implicit model, which I don't know how to address: objects whose elapsed wall-time for their work is different from their CPU time, because of communication or other dependencies on other, possibly-remote, objects.

PhilMiller commented 4 years ago

A few observations:

Forecasts of workload, communication, memory footprint, and any other resource we may want to optimize might have distinct methods, which may be mixed and matched differently to get the best results for any given application and/or input problem. So, I think we want to separate them
We may want to run multiple different balancing strategies and choose among their outputs, and that extends to using different kinds of forecasts as well. So, I think we want our framework to facilitate side-by-side instantiation, various parametrization, etc.

From the top down, I think an overall instantiation of a well-factored framework looks something like the following:

struct LoadOptimizer {
  // These are the inputs to the strategy
  WorkModel work;
  MemoryFootprintModel memory;
  CommunicationModel comm;

  Strategy strategy;
}

class LoadModel {
  virtual void addData(unsigned int phase, unsigned int subphase, shared_ptr<const LoadSnapshot> loads) = 0;
}

class WorkModel : public LoadModel {
  virtual double get(ObjectID object, PhaseOffset when) = 0;
};

class MemoryFootprintModel : public LoadModel {
  virtual Footprint get(ObjectID object, PhaseOffset when) = 0;
};

class CommunicationModel : public LoadModel {
  virtual Communication get(ObjectID object, PhaseOffset when) = 0;
  virtual Communication get(ObjectID object, ObjectID partner, PhaseOffset when) = 0;
  virtual Communication get(ObjectID object, ProcessID partner, PhaseOffset when) = 0;
};

struct LoadSnapshot {
  double getWork(ObjectID object);
  Footprint getMemory(ObjectID object);

  // Graph edges
  Communication getCommunication(ObjectID object, ObjectID partner);
  Communication getCommunication(ObjectID object, ProcessID partner);

  // Cumulative
  Communication getCommunication(ObjectID object);

  // Background/fixed loads
  double getWork(ProcessID object);
  Footprint getMemory(ProcessID object);
  Communication getCommunication(ProcessID object);
};

struct Footprint {
  size_t persistent;
  // Transient memory footprint will be measured as excess over `persistent', so that 
  // constraint functions can be computed as sum_obj(persistent) + max_obj(transient)
  size_t transient;
};

struct Communication {
  size_t sent_bytes, recv_bytes;
  size_t sent_messages, recv_messages;
};

struct PhaseOffset {
  unsigned int phases;
  constexpr unsigned int NEXT_PHASE = 0;

  unsigned int subphase;
  constexpr unsigned int WHOLE_PHASE = 0;
};

PhilMiller commented 4 years ago

Note that we could do all sorts of runtime tracking of how well various models are doing at predicting what they purport to, and over what horizon.

In the long run, we may want to factor things down even further, to have models per type of object, or per collection instance. With the relatively simple applications at hand, with mostly just a single object type, that may be unnecessary. OTOH, if ObjectID can be made to clearly identify the enclosing collection and the type thereof, then we can define model classes that delegate to subsidiary models based on those.

PhilMiller commented 4 years ago

Different strategies are going to have different data assembly requirements. At the very least, we've can see cases for consolidated data where the model receives the union of every process's inputs, and distributed data where the model received just the local process's input. There may be a case to make for replicated data where a model instance on every process receives the consolidated-type data.

Models and Strategies should each have some trait indicator stating their requirements, and the framework should feed them accordingly.

PhilMiller commented 4 years ago

One thing neglected in the design I posted above is where a current object mapping gets passed in. I think it's an added argument to both LoadModel::addData and the Strategy when called, since we don't want to create a requirement that models store that or the weird responsibility for them to reproduce it.

It ends up essentially describing the keys into the LoadSnapshot, whether they're the per-process subset of objects or the global set of objects.

PhilMiller commented 4 years ago

Per discussion with Jonathan, I should note that by 'phase' I mean the unit of periodic operation, and by 'subphase' I mean units of work that are comparable from one phase to the next.

PhilMiller commented 4 years ago

We identified iterative algorithms as a class of cases where sub-phases may need deeper hierarchy or sub-substructure. These include solvers and collision detection, that will take some micro-step a variable number of times, all as part of a larger operation

lifflander commented 4 years ago

struct ElementStats {
  using PhaseType       = uint64_t;
  using ArgType         = vt::arguments::ArgConfig;

  ElementStats() = default;
  ElementStats(ElementStats const&) = default;
  ElementStats(ElementStats&&) = default;

  void startTime();
  void stopTime();
  void addTime(TimeType const& time);
  void recvObjData(
    ElementIDType to_perm, ElementIDType to_temp,
    ElementIDType from_perm, ElementIDType from_temp, double bytes, bool bcast
  );
  void recvFromNode(
    ElementIDType to_perm, ElementIDType to_temp, NodeType from,
    double bytes, bool bcast
  );
  void recvToNode(
    NodeType to, ElementIDType from_perm, ElementIDType from_temp,
    double bytes, bool bcast
  );
  void setModelWeight(TimeType const& time);
  void updatePhase(PhaseType const& inc = 1);
  PhaseType getPhase() const;
  TimeType getLoad(PhaseType const& phase) const;
  CommMapType const& getComm(PhaseType const& phase);

  template <typename Serializer>
  void serialize(Serializer& s);

public:
  template <typename ColT>
  static void syncNextPhase(PhaseMsg<ColT>* msg, ColT* col);

  template <typename ColT>
  friend struct collection::Migratable;

protected:
  bool cur_time_started_ = false;
  TimeType cur_time_ = 0.0;
  PhaseType cur_phase_ = fst_lb_phase;
  std::vector<TimeType> phase_timings_ = {};
  std::vector<CommMapType> comm_ = {};
};

PhilMiller commented 4 years ago

From further discussion, we're thinking about going with a representation of execution structure as follows:

A phase consists of a sequence of one or more subphases - its execution time is the sum of the execution times of the subphases
A subphase consists of a set of one or more concurrent components - its execution time is a slightly finicky combination of sums and maxes TBD
A component may do one or more iterations of work - its execution time is roughly the sum of all the contained work

PhilMiller commented 4 years ago

For the first-pass implementation, we'll ignore components and iterations entirely, flattening to just phases and subphases

PhilMiller commented 4 years ago

Pushed some initial code with indirection to the NaivePersistence model on a 582-model-loads branch. It's applied for greedy and hierarchical right now, since those were easiest to modify - just one spot, it seemed.

I'll next substitute a model that also adds an imputed overhead per message received, to account for the costs that we're observing in practice, so that we can benchmark with the LB strategy getting a clearer practical view of the loads.

PhilMiller commented 4 years ago

That branch, along with a rebase onto beta.5 on 582-model-loads-beta5-rebase, now has a rough cut of the immediately relevant functionality - the naive model looks for messages to objects it knows about, and adds a millisecond to the reported workload for each message they received.

PhilMiller commented 4 years ago

That should be enough to run an application benchmark and see if this mitigates the problem of light objects that receive lots of messages being mis-assigned.

lifflander commented 4 years ago

Should we create a pull request for this modeling loads so this doesn't get lost?

DARMA-tasking / vt

Object load modeling in place of implicit short-time persistence #582