Open ferdinandhubbard981 opened 2 years ago
Currently, we have the broker with an internal state of the whole 2d world. On priming/reinitialisation of workers, it will slice the world according to workSize and worker count, and distribute it to workers. The workers evolve 1 iteration, then send back the flipped cells to the broker.
When on a turn that hasn't reinitialised the workers, workers will build on their internal state and use halos given by the broker
You're suggesting that the halo is sent from worker -> broker -> newWorker
In the readme they suggest sending it directly from worker -> worker (this would obviously be better) The broker still needs to receive the world every turn. So the worker can send its slice of the world to the broker, without any response from the broker. Then the broker assembles the slices into a world and updates its internal state.
I will update the flowchart to try and explain what I mean, and we'll talk about it when we're ready to start implementing it (tomorrow).
What I don't like about halo exchange is that it requires the IP address of the next worker. If the next worker disconnects then it's going to error, causing the one sending to that to error - a chain reaction. The work would also need to be redistributed evenly between workers
We can probably have a branch with halo exchange and another for fault tolerance. Alternatively, we might have the workers poll for the IP address of the ones to listen to and get them from there. But this is communication overhead.
What I don't like about halo exchange is that it requires the IP address of the next worker. If the next worker disconnects then it's going to error, causing the one sending to that to error - a chain reaction.
If a worker disconnects: . broker assigns new worker to that slice . broker updates IP address of the workers neighbouring the replacement-worker . new rpc connection is made between these workers
The work would also need to be redistributed evenly between workers
I don't think work distribution would be affected by this. Each worker is assigned a slice of x rows, nothing has changed.
We can probably have a branch with halo exchange and another for fault tolerance.
If we start work on HALO exchange before we finish fault tolerance and step 2&4, the merge conflicts would be terrible/ impossible, and would require a lot of code rewriting. I think it best to do the tasks in sequential order.
Alternatively, we might have the workers poll for the IP address of the ones to listen to and get them from there. But this is communication overhead.
Are you talking about when a worker disconnects? If so I agree, but I don't think the communication overhead would trump the increased execution speed of the HALO exchange.
Here is how I will implement it: broker will have a goroutine constantly receiving flippedCells paired with at turn. if the pairedTurn = b.turn+1 and len(flippedCellBatches) = numOfWorkers: apply flipped cells to world and turn++
meanwhile: worker carries on processing next turn as soon as it receives halo from adjacent workers
if worker breaks: reprime all workers, and start from b.currentWorld and b.turn.
https://github.com/MathsPsychopath/GameOfLife/issues/13 continuation of this discussion