With delayed activation forwarding, we have a buffer for shfit and prev_outputs each of length stages, so we can hold 2 * stages microbatches of activations before needing additional (circular storage)
This PR fixes this check - previously the factor of 2 was the on the wrong side.
With delayed activation forwarding, we have a buffer for shfit and prev_outputs each of length stages, so we can hold 2 * stages microbatches of activations before needing additional (circular storage)
This PR fixes this check - previously the factor of 2 was the on the wrong side.