SpiNNakerManchester / JavaSpiNNaker

Implementation of the SpiNNaker host software in Java
Apache License 2.0
0 stars 2 forks source link

Job hangin QUEUED when bmp fails #1155

Open Christian-B opened 2 months ago

Christian-B commented 2 months ago

We had boards/cabinets where the BMP command failed

Jobs get allocated here but BMPSendTimedOutException (see log)

Job hangs in QUEUED

Found in /home/spalloc/spalloc.log on https://spinnaker.cs.man.ac.uk/ 2024-05-05 07:11:33.787 INFO 1176 --- [ThreadPoolTaskScheduler16] u.a.m.s.a.a.AllocatorTask : Job 452535 changes resulted in errors. 2024-05-05 07:11:36.799 ERROR 1176 --- [ThreadPoolTaskScheduler-8] u.a.m.s.a.b.BMPController : Requests failed on BMP 357

uk.ac.manchester.spinnaker.transceiver.ProcessException: when sending to 0:0:13, received exception: uk.ac.manchester.spinnaker.transceiver.BMPSendTimedOutException with message: Operation CMD_VER (GetBMPVersion(command=CMD_VER, sequence=51774, argument1=0, argument2=0, argument3=0)) timed out after 0.750000 seconds at uk.ac.manchester.spinnaker.transceiver.ProcessException.makeInstance(ProcessException.java:116) ~[SpiNNaker-comms-7.1.0-SNAPSHOT.jar!/:?] at uk.ac.manchester.spinnaker.transceiver.BMPCommandProcess$RequestPipeline.finish(BMPCommandProcess.java:464) ~[SpiNNaker-comms-7.1.0-SNAPSHOT.jar!/:?] at uk.ac.manchester.spinnaker.transceiver.BMPCommandProcess.call(BMPCommandProcess.java:164) ~[SpiNNaker-comms-7.1.0-SNAPSHOT.jar!/:?] at uk.ac.manchester.spinnaker.transceiver.Transceiver.get(Transceiver.java:1725) ~[SpiNNaker-comms-7.1.0-SNAPSHOT.jar!/:?] at uk.ac.manchester.spinnaker.transceiver.Transceiver.readBMPVersion(Transceiver.java:1839) ~[SpiNNaker-comms-7.1.0-SNAPSHOT.jar!/:?] at uk.ac.manchester.spinnaker.transceiver.BMPTransceiverInterface.readBMPVersion(BMPTransceiverInterface.java:859) ~[SpiNNaker-comms-7.1.0-SNAPSHOT.jar!/:?] at uk.ac.manchester.spinnaker.alloc.bmp.SpiNNaker1.canBoardManageFPGAs(SpiNNaker1.java:212) ~[classes!/:?] at uk.ac.manchester.spinnaker.alloc.bmp.SpiNNaker1.setLinkOff(SpiNNaker1.java:228) ~[classes!/:?] at uk.ac.manchester.spinnaker.alloc.bmp.BMPController$PowerRequest.changeBoardPowerState(BMPController.java:502) ~[classes!/:?] at uk.ac.manchester.spinnaker.alloc.bmp.BMPController$PowerRequest.lambda$tryProcessRequest$10(BMPController.java:621) ~[classes!/:?] at uk.ac.manchester.spinnaker.alloc.bmp.BMPController$Request.bmpAction(BMPController.java:279) ~[classes!/:?] at uk.ac.manchester.spinnaker.alloc.bmp.BMPController$PowerRequest.tryProcessRequest(BMPController.java:620) ~[classes!/:?] at uk.ac.manchester.spinnaker.alloc.bmp.BMPController$Request.processRequest(BMPController.java:384) ~[classes!/:?] at uk.ac.manchester.spinnaker.alloc.bmp.BMPController$Worker.run(BMPController.java:1079) ~[classes!/:?] at uk.ac.manchester.spinnaker.alloc.bmp.BMPController.lambda$triggerSearch$4(BMPController.java:226) ~[classes!/:?] at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54) [spring-context-5.3.30.jar!/:5.3.30] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) [?:?] at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?] at java.lang.Thread.run(Thread.java:840) [?:?] Caused by: uk.ac.manchester.spinnaker.transceiver.BMPSendTimedOutException: Operation CMD_VER (GetBMPVersion(command=CMD_VER, sequence=51774, argument1=0, argument2=0, argument3=0)) timed out after 0.750000 seconds at uk.ac.manchester.spinnaker.transceiver.BMPCommandProcess$RequestPipeline.resend(BMPCommandProcess.java:530) ~[SpiNNaker-comms-7.1.0-SNAPSHOT.jar!/:?] at uk.ac.manchester.spinnaker.transceiver.BMPCommandProcess$RequestPipeline.handleReceiveTimeout(BMPCommandProcess.java:515) ~[SpiNNaker-comms-7.1.0-SNAPSHOT.jar!/:?] at uk.ac.manchester.spinnaker.transceiver.BMPCommandProcess$RequestPipeline.finish(BMPCommandProcess.java:456) ~[SpiNNaker-comms-7.1.0-SNAPSHOT.jar!/:?]

rowleya commented 2 months ago

Note that the "stuck in queued" appears to be that after the failure, the same board is again tried, and this repeats. Ideally a board that is attempted and fails is marked as having been allocated to avoid this repetition. Even more ideally, the board would be disabled after a number of failures, and an admin emailed for evaluation.