POETSII / Orchestrator

The Orchestrator is the configuration and run-time management system for POETS platforms.
1 stars 1 forks source link

"Place /tfill" putting all devices of same type on one thread #303

Open m8pple opened 2 years ago

m8pple commented 2 years ago

When placing a graph with 126 devices, it puts 1 device of one type on the first thread, and the remaining 125 of the same type on the next thread. This makes it extremely slow to execute, and slightly larger graphs will fail to compose because too many devices are on one thread.

This is partially related to #302, but even if tfill is not a good default I would still not expect this behaviour for tfill, as it is not dishing devices out to the threads one by one.


Documentation read before-hand:

I couldn't find anything else related to tfill expected behaviour in the documentation, but I assume it is not supposed to load up one thread.


Setup:

Input file: water.xml.zip

Orchestrator commands:

    load /app = "/home/dt10/poets-dpd/water.xml"
    tlink /app = *
    placement /tfill = *
    place /dump = *
    compose /app = *
    deploy /app = *
    initialise /app = *
    run /app = *

Placement log:

[core]
O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M00.C0,1
O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M00.C2,125
[mailbox]
O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M00,126

Partial thread placement log:

O_..blurble.c_0_0_0,O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M00.C2.T00
O_..blurble.c_1_0_0,O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M00.C2.T00
O_..blurble.c_2_0_0,O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M00.C2.T00
O_..blurble.c_3_0_0,O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M00.C2.T00
O_..blurble.c_4_0_0,O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M00.C2.T00
O_..blurble.c_0_1_0,O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M00.C2.T00
O_..blurble.c_1_1_0,O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M00.C2.T00
O_..blurble.c_2_1_0,O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M00.C2.T00
O_..blurble.c_3_1_0,O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M00.C2.T00
O_..blurble.c_4_1_0,O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M00.C2.T00
O_..blurble.c_0_2_0,O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M00.C2.T00
O_..blurble.c_1_2_0,O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M00.C2.T00
O_..blurble.c_2_2_0,O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M00.C2.T00
O_..blurble.c_3_2_0,O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M00.C2.T00
O_..blurble.c_4_2_0,O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M00.C2.T00
O_..blurble.c_0_3_0,O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M00.C2.T00
O_..blurble.c_1_3_0,O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M00.C2.T00
O_..blurble.c_2_3_0,O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M00.C2.T00
O_..blurble.c_3_3_0,O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M00.C2.T00
...
m8pple commented 2 years ago

Moving to "spread" worked as expected, with evenly spread device placement:

[core]
O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M00.C0,1
O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M10.C0,1
O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M10.C1,1
O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M10.C2,1
O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M10.C3,1
O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M20.C0,1
O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M20.C1,1
O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M20.C2,1
O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M20.C3,1
O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M30.C0,1
O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M30.C1,1
O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M30.C2,1
O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M30.C3,1
O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M01.C0,1
O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M01.C1,1
O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M01.C2,1
O_.POETSHardwareOneBox.ocfg.LoneBox.B00.M01.C3,1
...

Plus it dropped graph run-time from 3 minutes down to 3 seconds.

mvousden commented 2 years ago

To me, this looks like expected behaviour. The graph contains 1 device of type reaper, and 125 devices of type cell, so it places one reaper on the "first" thread in the "first" core pair, and the remaining devices on the "first" thread in the next core pair.

tfill (the thread-filling algorithm) fills up each thread in sequence, which is what placement.md explains I think.

Do you have any ideas for how I can change the documentation to make this clearer?

m8pple commented 2 years ago

I think the confusion comes from the idea of "filling a thread" - for the reader it feels like the placer has some idea about what it means to fill a thread, and will do something sensible with knowledge of per-thread capacity limits or something.

Then MaxDevicesPerThread appears later as more like a tuning knob, as it isn't mentioned in the default command flow, and appears as more of a general hint. It does appear in the example in placement.md before tfill, but it isn't obvious that you should pretty much always set MaxDevicesPerThread if you are using tfill.

Overall I think that including tfill as the default is a bad choice anyway. Unless it is carefully calculated, for most users it is going to result in either:

Overall I would suggest: