Placement Infrastructure

mvousden commented 4 years ago

This changeset introduces the designed Placer structure and interface to the Orchestrator. It maintains feature parity with the existing Orchestrator, and introduces some experimental features, including a simulated-annealing placer, a random placer, and the ability to save and load placement information (experimental here means that it either doesn't work end-to-end, or is so nuanced as to be unusable by anyone apart from @mvousden).

There are obvious components of this changeset that are not finished. Our intention is to introduce a working set of components that plays nicely with the rest of the Orchestrator before adding further features (i.e., review/merge as early and often as possible).

This changeset has been tested (all on Ayres):

The unit tests (which all pass). Note the lack of placement unit tests - I plan to add these when the V4 parser is introduced to avoid writing the same tests multiple times.
The Mothership test examples in the orchestrator-examples repository.
A 128x128 heated plate, placed on 32, 16, 8, and 4 devices per thread. The former three complete in 104, 44, and 19 seconds respectively. The latter causes the Orchestrator to complain that there is not enough space to place the task (as it should). I've also tested 40x40 and 3x3 heat plates in various configurations. All plates were tested using bucket filling.

A more fine-grained overview of the changes:

Removes the old placement and constraints system completely, and replaces it with a new system. Threads no longer hold devices (and vice versa) - this relationship is encapsulated by the placement logic. Unlinking is no longer managed by devices, and is instead also managed by the placement logic.
Introduces "core pairs" (P_core::pair) into the hardware model and deployers, to better support shared instruction memory behaviour.
Moves some placement conveniences (like P_engine::get_boxes_for_task) out of the hardware model and into the placement system.
OrchBaseLink.cpp is now OrchBasePlace.cpp, and place operator commands now exist in place of link operator commands (supporting backwards compatibility for bucket filling).
Integration of the new placement system into OrchBaseTask.cpp, including in initialisation and deployment subcommands.
Integration of the new placement system into OrchBaseTopo.cpp. Loading a new hardware model irrecoverably clobbers placement information.
P_builder. Good luck with that.
Lots of (accidental) trailing whitespace removal (sorry).
Adds a series of flags for GDB and Valgrind executables in the launcher. Purely a convenience mechanism.
Disables contracted floating point operations when compiling the Softswitch. I thought this was done elsewhere...
Adds a 1-box test for Dialect 3 (also used in testing).

This PR complements the PR at https://github.com/POETSII/orchestrator-documentation/pull/4. When both of these PRs are approved, I will:

Merge them.
Close (now redundant) PR #118.

Edit 2020-09-07: Grammar

heliosfa commented 4 years ago

I have been doing a little more playing with this in an attempt to break it/find flaws.

I wanted to validate your times from this:

A 128x128 heated plate, placed on 32, 16, 8, and 4 devices per thread. The former three complete in 104, 44, and 19 seconds respectively.

and compare to the current state of play. Unfortunately, from the documentation and a quick search of the source, I can't see how to constrain the number of devices per thread at runtime without recompiling. From the docs, I was guessing that it would be with the placement /constraint command, but the docs state that his is not defined? Care to enlighten me?

I'll comment on the docs in more detail, but they are lacking in examples and a command "quick reference".

The latter causes the Orchestrator to complain that there is not enough space to place the task (as it should). I've also tested 40x40 and 3x3 heat plates in various configurations.

Interesting, a 128x128 at 4 per thread should fit on the full system - it does with the "old" placement and the byron uif. A 6-board box has 6,144 threads - we would use 4097 of these for the 16,384 devices of a 128x128 plate.

All plates were tested using bucket filling.

How do I specify this as an algorithm? I am assuming that it defaults to it, but there is no list of supported algorithms.

Another thing, just to see what would happen, I threw my call files at it.

task /path = "/home/gmb/Orchestrator/application_staging/xml"
topology /load = "/local/orchestrator-common/single-box.uif"
topology /constrain="DevicesPerThread",4
task /load = "plate_128x128.xml"
link /link = "plate_128x128"

task /build = "plate_128x128"

task /deploy = "plate_128x128"
task /init = "plate_128x128"
task /run = "plate_128x128"

I get this error:

POETS> 19:24:32.70: 136(I) Whatever you asked for, it's not working yet

from the topology /constrain="DevicesPerThread",4 line. While I don't expect this to work, it would be nicer if the error told me which command from the call file it did not like.

It also seems like the link /link = "plate_128x128" line works as it results in POETS> 19:28:00.69: 209(I) Attempting to place task 'plate_128x128' using the 'link' method..., despite the docs not mentioning it?

mvousden commented 4 years ago

Feedback from our conversation:

Support placement /constraint = DevicesPerThread(N)
Support placement /constraint = ThreadsPerCore(N), understanding that it will tank simulated annealing performance.
Add list of implemented algorithms to documentation.
Add list of implemented constraints to documentation.
Add small set of examples to top of S4 in documentation, illustrating how placement can work.
128x128 4/thread should fit - check that it actually does, in contrast with the OP.
Add link /link to documentation.
Define default value of maximum-devices-per-thread in the documentation (or at least say where the constant is defined).

mvousden commented 4 years ago

@heliosfa: So I've made the agreed changes and tested them locally, but I haven't tested them on Tinsel hardware. If you want to have a pop at them feel free, but I'm tired (and a little ill). If you don't want to have a go that is fine also - I will do them tomorrow morning in that case.

For reference, my batch script has been (with various applications):

topology /load = "/home/mark/repos/orchestrator/Tests/StaticResources/Dialect3/Valid/1_box.uif"
task /path = "/home/mark/repos/orchestrator/application_staging/xml/"
task /load = "micromagnetics_1d_100.xml"
place /constrain = "MaxDevicesPerThread", 99999
place /constrain = "MaxDevicesPerThread", 5
place /constrain = "MaxThreadsPerCore", 99999
place /constrain = "MaxThreadsPerCore", 3
place /gc = "micromagnetics"
place /dump = "micromagnetics"
task /build = "micromagnetics"

I also inspect the dumps (in bin) to be sure the behaviour matches my expectations.

I've updated the documentation as well, which will explain what place /gc does, amongst other things.

Edit: If you want to play with simulated annealing and the gradient-less climber algorithms, beware they will take a while, unless you modify ITERATION_MAX in SimulatedAnnealing.h (this will eventually be part of the placement configuration file, never fear). 1000 is a good number, so try 12345.

heliosfa commented 4 years ago

So, doing some testing on Ayres and I get this:

POETS> 23:38:21.75: 204(W) Unable to place task 'plate_128x128' - we tried, but an integrity check failed. You should shout at MLV (or whoever wrote the algorithm you're trying to use). In the short term, consider resetting the placer, and trying a different algorithm. Details: [ERROR] Hard constraints were violated when using algorithm 'rand' on the task from file '/home/gmb/Orchestrator/application_staging/xml/plate_128x128.xml'. The violated constraints are:
 - Maximum devices per thread must be less than or equal to 8.

When I run

task /path = "/home/gmb/Orchestrator/application_staging/xml"
topology /load = "/local/orchestrator-common/single-box.uif"
task /load = "plate_128x128.xml"
place /constrain = "MaxDevicesPerThread", 8
place /rand = "plate_128x128"

task /build = "plate_128x128"

task /deploy = "plate_128x128"
task /init = "plate_128x128"
task /run = "plate_128x128"

If I do link /link = "plate_128x128" instead of place /rand = "plate_128x128", it places.

Also, something seems to be a little bit of a miss still - if I constrain MaxDevicesPerThread to 4, it complains with:

POETS> 23:54:57.70: 209(I) Attempting to place task 'plate_128x128' using the 'link' method...
POETS> 23:54:57.70: 206(W) Unable to place task 'plate_128x128' - not enough space in the hardware model.

constraining to 6 works, constraining to 5 does not. This happens with the 1-box UIF in the repo as well.

In any case, a devices-per-thread constraint of 4 should work for this.

mvousden commented 4 years ago

I think I've fixed the issues you've raised. For the 128x128 plate at 4 devices per thread, I get 11 seconds for bucket-filled placement.

The build times are very slow though (~200 seconds for random placement), and dominate everything. Not much we can do about it for now.

heliosfa commented 4 years ago

Unfortunately something else seems to have gone wrong somewhere are when I execute my batch script, I get:

 10:52:47.02: 102(I) Task graph default file path is || ||
POETS> 10:52:47.02: 103(I) New path is ||/home/gmb/Orchestrator/application_staging/xml/||
POETS> 10:52:47.02: 140(I) Topology loaded from file ||/local/orchestrator-common/single-box.uif||.
POETS> 10:52:48.70: 222(I) Constraining the maximum number of placed devices per thread in the hardware model to 4.
POETS> 10:52:48.70: 209(I) Attempting to place task 'plate_128x128' using the 'link' method...
POETS> 10:52:48.70: 202(I) Task 'plate_128x128' placed successfully.
POETS> 10:53:03.56: 504(E) Mothership: Error decoding MPI message with key '0xd4e0000': Expected non-empty string in field 1. Ignoring message.
POETS> 10:53:03.56: 504(E) Mothership: Error decoding MPI message with key '0xd4a0000': Expected non-empty string in field 1. Ignoring message.
POETS> 10:53:03.56: 504(E) Mothership: Error decoding MPI message with key '0xd4a0000': Expected non-empty string in field 1. Ignoring message.
POETS> 10:53:03.56: 504(E) Mothership: Error decoding MPI message with key '0xd4a0000': Expected non-empty string in field 1. Ignoring message.
POETS> 10:53:03.56: 504(E) Mothership: Error decoding MPI message with key '0xd4a0000': Expected non-empty string in field 1. Ignoring message.
POETS> 10:53:03.56: 504(E) Mothership: Error decoding MPI message with key '0xd4a0000': Expected non-empty string in field 1. Ignoring message.
POETS> 10:53:03.56: 504(E) Mothership: Error decoding MPI message with key '0xd4a0000': Expected non-empty string in field 1. Ignoring message.
POETS> 10:53:03.56: 504(E) Mothership: Error decoding MPI message with key '0xd4a0000': Expected non-empty string in field 1. Ignoring message.
POETS> 10:53:03.56: 504(E) Mothership: Error decoding MPI message with key '0xd4a0000': Expected non-empty string in field 1. Ignoring message.
POETS> 10:53:03.56: 504(E) Mothership: Error decoding MPI message with key '0xd4a0000': Expected non-empty string in field 1. Ignoring message.

Sometimes it complains about key 0xd4e0000'

This is the batch script:

task /path = "/home/gmb/Orchestrator/application_staging/xml"
topology /load = "/local/orchestrator-common/single-box.uif"
task /load = "plate_128x128.xml"
place /constrain = "MaxDevicesPerThread",4

link /link = "plate_128x128"

task /deploy = "plate_128x128"
task /init = "plate_128x128"
task /run = "plate_128x128"

EDIT: whoops, I forgot the build line.

I just got 6 seconds with bucket filling so that seems good.

POETSII / Orchestrator

Placement Infrastructure #151