Submitting a job with n_boards=24 n_samples=1000 to spalloc at the moment gives me the machine with IP address 10.11.198.1; when running this, the coordinator at (28,32,1) fails to send the data correctly. As I currently have an io_printf for every time it (re)tries to send data, this core then eventually RTEs:
In this instance, every other coordinator vertex (core 1 on an "ethernet chip") manages to send its data on the first try. However, there doesn't appear to be anything necessarily wrong with the retry process as I have seen it work correctly in other scenarios.
It's possible I suppose that this may be related to the issues that @rowleya is currently investigating within scamp, but I'll keep looking at it when I have time to do so.
A more detailed look at this indicated that there was actually a problem with the chip('s router) at (28,32) on machine 10.11.198.1; this chip was black-listed, and the problem went away.
Submitting a job with n_boards=24 n_samples=1000 to spalloc at the moment gives me the machine with IP address 10.11.198.1; when running this, the coordinator at (28,32,1) fails to send the data correctly. As I currently have an io_printf for every time it (re)tries to send data, this core then eventually RTEs:
The output on the core shows that some of the data gets sent on the first try, but then it gets stuck afterwards:
In this instance, every other coordinator vertex (core 1 on an "ethernet chip") manages to send its data on the first try. However, there doesn't appear to be anything necessarily wrong with the retry process as I have seen it work correctly in other scenarios.
It's possible I suppose that this may be related to the issues that @rowleya is currently investigating within scamp, but I'll keep looking at it when I have time to do so.