stride > 2 (can be corrected) is done a bit different
init of weights is different (for conv and dense)
Empirically: SGD works much better for PyTorch (reaching 95.5%).
Batch norm seems to be implemented differently for the 2 frameworks: it causes a difference in loss at training time. But we can't verify that when comparing the batch norm implem in isolation.
ResNet implem: