POETSII / tinsel

Manythread RISC-V overlay for FPGA clusters
Other
35 stars 1 forks source link

Avoids two bugs related to readyToSend modulation #107

Closed m8pple closed 3 years ago

m8pple commented 3 years ago
  1. Send handler can get called when readyToSend=No
  2. Ready to send queue can over-flow and corrupt memory

The problem occurs if a device changes readyToSend multiple times, which happens in things like physics simulations where you want to send if some local condition is met or not. Incoming packets can change whether the local condition is met, and some set of devices will often be right on the edge of the condition.

Problem 1:

recv_handler(id) {
   *readyToSend=Pin(0);
}
*(sendersTop++) = id;

recv_handler(id) {
   *readyToSend=No;
}

send_handler(id) {
   assert(*readyToSend != No);  // Fails
}

This means the send handler still gets called even though the device no longer wants to send.

Problem 2:

recv_handler(id) {
   *readyToSend=Pin(0);
}
*(sendersTop++) = id;

recv_handler(id) {
   *readyToSend=No;
}

recv_handler(id) {
   *readyToSend=Pin(0);
}
*(sendersTop++) = id;
// id is now on the send list twice

This failure mode is more subtle, and only cropped up with lots of devices per thread. Eventually the senders array over-flows, and eventually you get errors due to memory corruption popping up.

I wrapped the send queue up into some helper functions, as it makes it a bit clearer when things could enter and exit the queue. It also gives some flexibility for things like switching to a fair queue, or optimisations if there is only 1 device per thread.

mn416 commented 3 years ago

Thanks, this is a nice improvement. I was aware of the limitation but thought (wrongly it seems) that it wasn't important. As you've shown, there's very little cost to doing it properly. And it's nice to abstract over the senders array like this.

mn416 commented 3 years ago

I observe a 12% performance regression on heat-grid-sync on Ayres with this change. Not immediately sure why, looking at the code...

m8pple commented 3 years ago

I noticed a perf change too, though in my case it was the difference between slightly slower (~1-3%) and broken. But yes, there is a noticeable cost.

It looks like heat-grid-sync has ~3 instructions in the receive path (load, add, store), so an extra few instructions on the surrounding RTS management could be quite noticeable even just in terms of instructions per receive. The extra tracking byte also affects the memory pressure a little, though I would be surprised if that matters here.

I wondered about having the application communicate to the PDevice that it would not invalidate readyToSend, but this felt like a much bigger change. There are lots of things that an application could tell the POLite/PDevice, and also things that the topology/mapper could tell it too (e.g. devices_per_thread={0..1,1,0..32,1..32,0..inf}) but it probably needs some kind of larger re-think. Ultimately it would get away from the idea of POLite being simple, as this is the kind of deeper stuff that the orchestrator is supposed to do.

The unfair POLite sender list is another example - the per-recv and per-send cost is a fair bit (ha-ha!) lower than a fair sender list, but it can drastically reduce convergence rates on certain algorithms approaching a fixed-point (e.g. relaxation problems). But... if the application is correct then it will eventually converge, and it is not really the fault of the run-time fabric if it doesn't happen to choose the most optimal path. I played around with various fair sender lists for POLite, and the cost is quite high for most applications, while providing a large perf improvement for a smaller sub-set of applications. But then randomised sender list was even better for other applications, as the per send cost was high, but the app converged much faster...

mn416 commented 3 years ago

Interesting re the queue design space. Even more reason for abstracting its representation.