POETSII / Orchestrator

The Orchestrator is the configuration and run-time management system for POETS platforms.
1 stars 1 forks source link

Placement error from "rand" placer #304

Open m8pple opened 2 years ago

m8pple commented 2 years ago

When trying to place a graph using the "rand" method, the following error shows up:

POETS> 14:20:53.71: 309(I) Attempting to place graph instance 'blurble' using the 'rand' method...
POETS> 14:20:53.71: 304(W) Unable to place graph instance 'blurble' - we tried, but an integrity check failed. You should shout at MLV (or whoever wrote the algorithm you're trying to use). In the short term, consider resetting the placer, and trying a different algorithm. Details: [ERROR] Use of algorithm 'rand' on application graph instance 'blurble' from file '/home/dt10/poets-dpd/experiments/orch-scaling/inputs/stationary_water_14x14x14_8192.xml' resulted in some normal devices not being placed correctly. These devices are:
 - rr

This appears to happen for graphs over a certain size. For smaller graphs it completed, for larger graphs the same error. This was approximately the first graph size where it failed.

Context

mvousden commented 2 years ago

I can't reproduce this (hohoho). That said, I have a few thoughts.

The random algorithm works in the following way (with implementationy bits in parentheses):

  1. For each device type, create a set of cores that could be used to house a device of that type. Recall that each core pair cannot house devices of different types. (This map is defined by Placer::define_valid_cores_map)
  2. For each device in the device graph, iterated in the order they are declared in the graph instance: a. Ignore that device if it is not a normal device (i.e. continue). b. If the set for devices of this type is empty, leave and let the integrity checker clean up (literally return -1). c. Choose a core at random from the set. d. For each thread in the selected core, if the selected thread has space, or has no other constraint that forbids the selected device from being placed upon it, go to "f". If no threads are legal, go to "e". e. Remove the selected core from the set, and go to "b". f. Place the selected device on the selected thread. g. Remove the selected core from each other set in the device type map we defined in "1".
  3. Redistribute each device in each core such that each thread is evenly loaded (calling Placer::redistribute_devices_in_gi)

Based off this, I think the ff device is being left until the end, at which point there are no cores available for it, because it is of a different device type.

Solutions:

Workarounds:

mvousden commented 2 years ago

Also, the error message in this case is pretty unhelpful - need to refactor that.

heliosfa commented 2 years ago

Allow the algorithm to "reserve" certain cores for certain device types (a bit like what the spread method does). This would make the algorithm "less random" however, so is undesirable.

One way I can think to keep the randomness is to introspect the number of device types that are actually used in the graph (not just declared) to work out how many sets of cores are needed and what the max/min number of cores should be, e.g. 1 set per type with minimum 1 and maximum something sensible. Allocate cores randomly to the sets. These sets then replace the set from step 1.

m8pple commented 2 years ago

Put ff at the top of your DeviceInstances element in your application XML.

This application-specific workaround allows random to place correctly.