Open m8pple opened 2 years ago
Actually, in placement.pdf I see the comment:
To support different algorithms, selectable at run time by the Orchestrator operator. In early iterations, thread (bucket) filling and simulated annealing is sufficient
so maybe users should start with tfill and move onto sa. Or try both?
This problem was initiated by very slow run-times caused by tfill (#303), and at least at the moment spread is drastically better than tfill:
I think spread
is more generally-useful. The tfill
default is a holdover from when we had bucket filling as the default (and we didn't want to break existing scripts). I'm quite happy to change this, if you think it's wise.
tfill
being the default is definitely a holdover from when it was the only placement method and because it hangs onto the idea that a thread hosts a set number of devices (initially 1024). It can be useful forgetting compact placements if properly constrained in the number of devices per thread and the number of threads per core. Unless you have a very large problem that would be run at 256 devices/thread, it is going to be slow.
If we are going to make spread
the default, then we need to take care with how it interacts with the default cluster sizes and the enhancements made in #296. Basically, to get a valid and sensible spread
placement, the user/operator needs to start the Orchestrator with the correct cluster size definition and load the correct hardware description file - if either of these are wrong, the outcome will either be a mapping that uses too little (so runs slow) or too much (so won't run at all) hardware.
If we are going to make spread the default, then we need to take care with how it interacts with the default cluster sizes and the enhancements made in #296. Basically, to get a valid and sensible spread placement, the user/operator needs to start the Orchestrator with the correct cluster size definition and load the correct hardware description file - if either of these are wrong, the outcome will either be a mapping that uses too little (so runs slow) or too much (so won't run at all) hardware.
I think that is my understanding of what tfill does: it either under-utilises the hardware, or doesn't place.
For both tfill and spread (any placement method?) they have to choose the correct hardware description file, but with tfill they also have to calculate a reasonable MaxDevicesPerThread.
I think that is my understanding of what tfill does: it either under-utilises the hardware, or doesn't place.
I agree that it under-utilises for a lot of problems but it should always place successfully, unless you have too many devices... By default it does 256 devices per thread so you need a lot of devices for it to fall over.
but with tfill they also have to calculate a reasonable MaxDevicesPerThread.
I guess it comes down to what is more desirable - a default behaviour that is faster and makes better use of the hardware, but is more likely to fail. Or a default that is almost guaranteed to work, all be it a bit slowly.
I do think that we should lower the default number of devices per thread for tfill though - the more we do with the Orchestrator, the more it becomes apparent that 256 devices per thread (let alone the original 1024) is a bit steep for most problems.
I think that is my understanding of what tfill does: it either under-utilises the hardware, or doesn't place.
I agree that it under-utilises for a lot of problems but it should always place successfully, unless you have too many devices... By > default it does 256 devices per thread so you need a lot of devices for it to fall over.
Sorry, there was an implicit "if you calculate MaxDevicesPerThread" there. As it stands, tfill won't work for end
users unless they set MaxDevicesPerThread. If they try to set MaxDevicesPerThread = ceil( Devices / Threads )
then
placement is likely to fail if they have more than one device type. Or they can be conservative and go for
MaxDevicesPerThread = ceil( 2 * Devices / Threads )
and probably end up with the front FPGAs overloaded and
the back FPGAs idle.
I guess it comes down to what is more desirable - a default behaviour that is faster and makes better use of the hardware, but is more likely to fail. Or a default that is almost guaranteed to work, all be it a bit slowly.
I think users are likely to expect all three: reasonably fast placement (though it might scale qausi-linear with device and thread count), decent utilisation of hardware (most threads are mostly used), and guaranteed success (if it fits, I sits). The alternative POETS run-times have always taken this approach, so in trying to attract users to switch to the Orchestrator that's really what we want to be offering.
I realise the one-device-type-per-core approach makes this a more complex problem, but from the users point of view the Orchestrator is just another run-time, so really they just want it to work.
A possible approach would be to provide an auto-tfill setting:
I'm not sure if spread does something like this automatically. So far I don't see placement failures when using it, but that might be luck.
Documentation read before-hand:
The default placement method in the user guide is "tfill", and there are a number of aliases to that as well so it seems like a good suggested default. I'm not sure what a good default placement method would be, but "tfill" seems like a bad default choice given the hyper-threading per core in tinsel.
If there are concerns about the run-time cost of "sa", would "random" be a better default, as it might cause more even loading of cores and DRAM channels? Or possibly "spread" rather than "tfill" (although spread is also thread-based, so not sure).
If "tfill" is considered the best default at the moment it would be useful to have a note saying "recommended" or something in section 5.9 of the user guide to indicate that it is a good default choice, as "sa" seems more attractive but is then not recommended.