Actual layer-sequence prediction is difficult to make differentiable, so we're going to just use naive shuffling for now. There is research to support the notion that random layer shuffling during training and inference can lead to more robust distributed architectures - which is something we're going to need, here. We will improve it later, if necessary.
Actual layer-sequence prediction is difficult to make differentiable, so we're going to just use naive shuffling for now. There is research to support the notion that random layer shuffling during training and inference can lead to more robust distributed architectures - which is something we're going to need, here. We will improve it later, if necessary.