SpiNNakerManchester / MarkovChainMonteCarlo

Markov Chain Monte Carlo Simulations on SpiNNaker
1 stars 2 forks source link

Coordinator fails to send messages within certain board configurations with spalloc #10

Closed andrewgait closed 4 years ago

andrewgait commented 4 years ago

Submitting a job with n_boards=24 n_samples=1000 to spalloc at the moment gives me the machine with IP address 10.11.198.1; when running this, the coordinator at (28,32,1) fails to send the data correctly. As I currently have an io_printf for every time it (re)tries to send data, this core then eventually RTEs:

2019-11-19 10:16:11 ERROR: 28, 32, 1: RUN_TIME_EXCEPTION (IOBUF) mcmc_coordinato 2019-11-19 10:16:11 ERROR: r0=0x0000000E r1=0x00000001 r2=0x00004018 r3=0x00000000 2019-11-19 10:16:11 ERROR: r4=0xE5007F00 r5=0x0040001C r6=0x67FFB760 r7=0x000AB072 2019-11-19 10:16:11 ERROR: PSR=0x0000001F SR=0x0040FBB8 LR=0x000012F7

The output on the core shows that some of the data gets sent on the first try, but then it gets stuck afterwards:

send_timeout 10 data_size 50 window_size 1024 timed_out 1, sending again timed_out, data_size 40 send_timeout 20 data_size 40 window_size 1024 timed_out 1, sending again timed_out, data_size 40 send_timeout 30 data_size 40 window_size 1024 timed_out 1, sending again timed_out, data_size 40 send_timeout 40 data_size 40 window_size 1024 timed_out 1, sending again timed_out, data_size 40 send_timeout 50 data_size 40 window_size 1024 timed_out 1, sending again timed_out, data_size 40 ... ... etc etc

In this instance, every other coordinator vertex (core 1 on an "ethernet chip") manages to send its data on the first try. However, there doesn't appear to be anything necessarily wrong with the retry process as I have seen it work correctly in other scenarios.

It's possible I suppose that this may be related to the issues that @rowleya is currently investigating within scamp, but I'll keep looking at it when I have time to do so.

andrewgait commented 4 years ago

A more detailed look at this indicated that there was actually a problem with the chip('s router) at (28,32) on machine 10.11.198.1; this chip was black-listed, and the problem went away.