Efficient training using COLLAPSED Linear block

I did a quick implementation in PyTorch but found no improvement in efficiency for training comparing to training with expanded blocks.

While the Collapsed block saves forward time in training, the forward time is only 1-2% comparing to backward propagation, hence the the overall time-saving is insignificant. Anyone noticed similar issue in TF version?

ARM-software / sesr

Efficient training using COLLAPSED Linear block #17