Last week I decided to try the NaCI client in some high threads count systems of mine, 88, 112 and 128 threads, just for curiosity. I had it working at minimum level in 48 threads hosts and also has checked that 72 threads works ok. All with Linux (Ubuntu).
They started with 9xxx units, that crunched fine and I let systems working. In the evening, I found the clients stopped with the "fatal error" or similar message. I restarted them and noticed that when the unit was 142xx the client inmediately asked for another, It seemed to me a download problem and I did not realize that it continued that way until a 9xxx was downloaded and crunched normally. I think now they were erroring tons of units until a 9xxx arrived or the client stopped. I'm afraid this lasted maybe two or three days.
Then, I noticed that I no longer received bonus for the GPU wus with or several SMP ones I tried. That caught my attention :) and I investigated a bit more, learnt about the NaCL console and I was able to reproduce the error and copy the message:
"DEBUG: There is no domain decomposition for 96 nodes that is compatible with the given box and a minimum cell size of 1.37225 nmChange the number of nodes or mdrun option -rcon or -dds or your LINCS settings"
96 figure is given in 128 systems, it is 84 in 112 and 72 in 88 threads system.
So the purpose of this long post (sorry :() is to report the error (it could be just in my side) and also ask to check the bonus issue, to go below the 80% non errored units ratio I should have sent some thousand of bad units, which seems a lot.
Quoting a report from FoldingForum