Keep track of failing boards and don't use them to allocate

SpiNNakerManchester / spalloc_server

A SpiNNaker machine allocation and partitioning application.

0 stars 2 forks source link

Keep track of failing boards and don't use them to allocate #26

Open rowleya opened 6 years ago

rowleya commented 6 years ago

Currently, if a board is failing, it will keep being assigned to other users, where it may continue to be a source of failure. The server should keep a "blacklist" of boards which should be avoided. How the blacklist is updated is open to debate e.g. it could count failures and then blacklist after a threshold, or it could periodically probe the boards (or possibly a combination of these).

alan-stokes commented 6 years ago

how would you detect a failure in this case?

rowleya commented 6 years ago

OK, so when you power on a board, a number of things can happen:

The board doesn't respond - this indicates a BMP failure.
The board doesn't get the correct FPGA ids - this could mean that the flash of the board is broken.

We can detect these things quite easily during a power on command. A periodic test would have to turn on boards periodically therefore.

alan-stokes commented 6 years ago

ok, so we're not thinking sdrams, new dead cores etc?

rowleya commented 6 years ago

No - to be clear, this is just to stop an attempt to allocate the same board resulting in the same server error repeatedly (as would currently happen if there is a board error). As spalloc-server only talks to the BMP that is enough. This can detect transient errors i.e. things that can be fixed by manual intervention (either someone pressing the reset button or re-flashing a board).

An extension of this is to run Luis's tests periodically as well to ensure the boards are tested, but I don't believe this is necessary.

Christian-B commented 6 years ago

It should cover anything where the spalloc server should not allocate the board again until at least a manual check.