when computing the gradient wrt. parameters on models canonized with MergeBatchNorm, the gradient would previously not be computed with respect to the original parameters, as they were detached from the graph/overwritten
now, the parameters of the linear module are explicitly set to tensors which depend on the original parameters, leading to the correct computation of gradients
set the batch-norm's eps to zero and store the old value, thus fixing the slightly different values
ParmMod: use object.setattr to overwrite the module attributes instead of setting a new torch.nn.Parameter to correctly track the gradient wrt. the original parameter
eps
to zero and store the old value, thus fixing the slightly different values