Open rowleya opened 6 years ago
how would you detect a failure in this case?
OK, so when you power on a board, a number of things can happen:
We can detect these things quite easily during a power on command. A periodic test would have to turn on boards periodically therefore.
ok, so we're not thinking sdrams, new dead cores etc?
No - to be clear, this is just to stop an attempt to allocate the same board resulting in the same server error repeatedly (as would currently happen if there is a board error). As spalloc-server only talks to the BMP that is enough. This can detect transient errors i.e. things that can be fixed by manual intervention (either someone pressing the reset button or re-flashing a board).
An extension of this is to run Luis's tests periodically as well to ensure the boards are tested, but I don't believe this is necessary.
It should cover anything where the spalloc server should not allocate the board again until at least a manual check.
Currently, if a board is failing, it will keep being assigned to other users, where it may continue to be a source of failure. The server should keep a "blacklist" of boards which should be avoided. How the blacklist is updated is open to debate e.g. it could count failures and then blacklist after a threshold, or it could periodically probe the boards (or possibly a combination of these).