Guidance on default placement method in user documentation

m8pple commented 2 years ago

Documentation read before-hand:

user_guide.md
placement.md
Orch_Vol_I, release 1.4.0
- Couldn't find anything related to placement.
Orch_Vol_III.pdf, release 1.4.0
- "4.4 Placement subsystem" : this was empty.

The default placement method in the user guide is "tfill", and there are a number of aliases to that as well so it seems like a good suggested default. I'm not sure what a good default placement method would be, but "tfill" seems like a bad default choice given the hyper-threading per core in tinsel.

If there are concerns about the run-time cost of "sa", would "random" be a better default, as it might cause more even loading of cores and DRAM channels? Or possibly "spread" rather than "tfill" (although spread is also thread-based, so not sure).

If "tfill" is considered the best default at the moment it would be useful to have a note saying "recommended" or something in section 5.9 of the user guide to indicate that it is a good default choice, as "sa" seems more attractive but is then not recommended.

m8pple commented 2 years ago

Actually, in placement.pdf I see the comment:

To support different algorithms, selectable at run time by the Orchestrator operator. In early iterations, thread (bucket) filling and simulated annealing is sufficient

so maybe users should start with tfill and move onto sa. Or try both?

m8pple commented 2 years ago

This problem was initiated by very slow run-times caused by tfill (#303), and at least at the moment spread is drastically better than tfill:

"Spread" can fit large graphs, while "tfill" can't do graphs with a 200-ish devices
"Spread" is about two orders of magnitude faster at run-time even for 100-ish node graphs

mvousden commented 2 years ago

I think spread is more generally-useful. The tfill default is a holdover from when we had bucket filling as the default (and we didn't want to break existing scripts). I'm quite happy to change this, if you think it's wise.

heliosfa commented 2 years ago

tfill being the default is definitely a holdover from when it was the only placement method and because it hangs onto the idea that a thread hosts a set number of devices (initially 1024). It can be useful forgetting compact placements if properly constrained in the number of devices per thread and the number of threads per core. Unless you have a very large problem that would be run at 256 devices/thread, it is going to be slow.

If we are going to make spread the default, then we need to take care with how it interacts with the default cluster sizes and the enhancements made in #296. Basically, to get a valid and sensible spread placement, the user/operator needs to start the Orchestrator with the correct cluster size definition and load the correct hardware description file - if either of these are wrong, the outcome will either be a mapping that uses too little (so runs slow) or too much (so won't run at all) hardware.

m8pple commented 2 years ago

If we are going to make spread the default, then we need to take care with how it interacts with the default cluster sizes and the enhancements made in #296. Basically, to get a valid and sensible spread placement, the user/operator needs to start the Orchestrator with the correct cluster size definition and load the correct hardware description file - if either of these are wrong, the outcome will either be a mapping that uses too little (so runs slow) or too much (so won't run at all) hardware.

I think that is my understanding of what tfill does: it either under-utilises the hardware, or doesn't place.

For both tfill and spread (any placement method?) they have to choose the correct hardware description file, but with tfill they also have to calculate a reasonable MaxDevicesPerThread.

heliosfa commented 2 years ago

I think that is my understanding of what tfill does: it either under-utilises the hardware, or doesn't place.

I agree that it under-utilises for a lot of problems but it should always place successfully, unless you have too many devices... By default it does 256 devices per thread so you need a lot of devices for it to fall over.

but with tfill they also have to calculate a reasonable MaxDevicesPerThread.

I guess it comes down to what is more desirable - a default behaviour that is faster and makes better use of the hardware, but is more likely to fail. Or a default that is almost guaranteed to work, all be it a bit slowly.

I do think that we should lower the default number of devices per thread for tfill though - the more we do with the Orchestrator, the more it becomes apparent that 256 devices per thread (let alone the original 1024) is a bit steep for most problems.

m8pple commented 2 years ago

I think that is my understanding of what tfill does: it either under-utilises the hardware, or doesn't place.

I agree that it under-utilises for a lot of problems but it should always place successfully, unless you have too many devices... By > default it does 256 devices per thread so you need a lot of devices for it to fall over.

Sorry, there was an implicit "if you calculate MaxDevicesPerThread" there. As it stands, tfill won't work for end users unless they set MaxDevicesPerThread. If they try to set MaxDevicesPerThread = ceil( Devices / Threads ) then placement is likely to fail if they have more than one device type. Or they can be conservative and go for MaxDevicesPerThread = ceil( 2 * Devices / Threads ) and probably end up with the front FPGAs overloaded and the back FPGAs idle.

I guess it comes down to what is more desirable - a default behaviour that is faster and makes better use of the hardware, but is more likely to fail. Or a default that is almost guaranteed to work, all be it a bit slowly.

I think users are likely to expect all three: reasonably fast placement (though it might scale qausi-linear with device and thread count), decent utilisation of hardware (most threads are mostly used), and guaranteed success (if it fits, I sits). The alternative POETS run-times have always taken this approach, so in trying to attract users to switch to the Orchestrator that's really what we want to be offering.

I realise the one-device-type-per-core approach makes this a more complex problem, but from the users point of view the Orchestrator is just another run-time, so really they just want it to work.

A possible approach would be to provide an auto-tfill setting:

Set MaxDevicesPerThread = ceil( Devices / Threads )`
Attempt placement
If placement failed, double MaxDevicesPerThread and go back to the start.

I'm not sure if spread does something like this automatically. So far I don't see placement failures when using it, but that might be luck.

POETSII / Orchestrator

Guidance on default placement method in user documentation #302