"Place /tfill" putting all devices of same type on one thread

m8pple commented 2 years ago

When placing a graph with 126 devices, it puts 1 device of one type on the first thread, and the remaining 125 of the same type on the next thread. This makes it extremely slow to execute, and slightly larger graphs will fail to compose because too many devices are on one thread.

This is partially related to #302, but even if tfill is not a good default I would still not expect this behaviour for tfill, as it is not dishing devices out to the threads one by one.

Documentation read before-hand:

user_guide.md
- "placement /tfill: Given a typelinked application graph (or multiple), places it onto the hardware by filling each thread in sequence."
placement.md
- "tfill: A thread-filling placement, where the threads in the hardware model are filled in sequence. This placement mechanism is device-type aware."

I couldn't find anything else related to tfill expected behaviour in the documentation, but I assume it is not supposed to load up one thread.

Setup:

Orchestrator : e74e6ee1e353935e074e4f9a522020aa029f9351 / FEATURE-0242-HardwareIdle
- No modifications made
Hardware : jennings
- Using stock tinsel in /local/tinsel

Input file: water.xml.zip

Orchestrator commands:

    load /app = "/home/dt10/poets-dpd/water.xml"
    tlink /app = *
    placement /tfill = *
    place /dump = *
    compose /app = *
    deploy /app = *
    initialise /app = *
    run /app = *

Placement log:

[core]
O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M00.C0,1
O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M00.C2,125
[mailbox]
O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M00,126

Partial thread placement log:

O_..blurble.c_0_0_0,O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M00.C2.T00
O_..blurble.c_1_0_0,O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M00.C2.T00
O_..blurble.c_2_0_0,O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M00.C2.T00
O_..blurble.c_3_0_0,O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M00.C2.T00
O_..blurble.c_4_0_0,O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M00.C2.T00
O_..blurble.c_0_1_0,O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M00.C2.T00
O_..blurble.c_1_1_0,O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M00.C2.T00
O_..blurble.c_2_1_0,O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M00.C2.T00
O_..blurble.c_3_1_0,O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M00.C2.T00
O_..blurble.c_4_1_0,O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M00.C2.T00
O_..blurble.c_0_2_0,O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M00.C2.T00
O_..blurble.c_1_2_0,O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M00.C2.T00
O_..blurble.c_2_2_0,O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M00.C2.T00
O_..blurble.c_3_2_0,O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M00.C2.T00
O_..blurble.c_4_2_0,O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M00.C2.T00
O_..blurble.c_0_3_0,O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M00.C2.T00
O_..blurble.c_1_3_0,O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M00.C2.T00
O_..blurble.c_2_3_0,O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M00.C2.T00
O_..blurble.c_3_3_0,O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M00.C2.T00
...

m8pple commented 2 years ago

Moving to "spread" worked as expected, with evenly spread device placement:

[core]
O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M00.C0,1
O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M10.C0,1
O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M10.C1,1
O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M10.C2,1
O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M10.C3,1
O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M20.C0,1
O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M20.C1,1
O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M20.C2,1
O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M20.C3,1
O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M30.C0,1
O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M30.C1,1
O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M30.C2,1
O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M30.C3,1
O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M01.C0,1
O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M01.C1,1
O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M01.C2,1
O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M01.C3,1
...

Plus it dropped graph run-time from 3 minutes down to 3 seconds.

mvousden commented 2 years ago

To me, this looks like expected behaviour. The graph contains 1 device of type reaper, and 125 devices of type cell, so it places one reaper on the "first" thread in the "first" core pair, and the remaining devices on the "first" thread in the next core pair.

tfill (the thread-filling algorithm) fills up each thread in sequence, which is what placement.md explains I think.

Do you have any ideas for how I can change the documentation to make this clearer?

m8pple commented 2 years ago

I think the confusion comes from the idea of "filling a thread" - for the reader it feels like the placer has some idea about what it means to fill a thread, and will do something sensible with knowledge of per-thread capacity limits or something.

Then MaxDevicesPerThread appears later as more like a tuning knob, as it isn't mentioned in the default command flow, and appears as more of a general hint. It does appear in the example in placement.md before tfill, but it isn't obvious that you should pretty much always set MaxDevicesPerThread if you are using tfill.

Overall I think that including tfill as the default is a bad choice anyway. Unless it is carefully calculated, for most users it is going to result in either:

Under-utilisation of the available hardware, making the application much slower; or
Failure to place. In terms of experimenting with the orchestrator it makes sense, but for the user who wants to get a graph running using reasonable defaults it doesn't work.

Overall I would suggest:

Remove all uses of tfill in the user_guide for the example flows, and replace with spread
On the documentation for tfill: strongly recommend that MaxDevicesPerThread is set if tfill is used, and note that the optimal value may require experimentation.

POETSII / Orchestrator

"Place /tfill" putting all devices of same type on one thread #303