Collecting PyTorch -> Flux migration notes

I'm in the process of moving a model from PyTorch to Flux, and I'm going to catalog the challenges I've found in migrating, perhaps for a doc page or to improve docs generally. If others wish to add anything, please do!

Weights initialization: In PyTorch, the default weight init method is kaiming_uniform aka He initialization. In Flux, the default weight init method is glorot_uniform aka Xavier initialization.

PyTorch chooses a gain for the init function based on the type of nonlinearity specified, which defaults to leaky_relu, and uses this:

a = √5 # argument passed into PyTorch's `kaiming_uniform_` function, the "negative slope" of the rectifier
gain = √(2 / (1 + a ^ 2))

Flux defaults the kaiming_uniform gain to √2, which is what PyTorch would use if relu was specified rather than leaky_relu.

To replicate the PyTorch default, kaiming_uniform(gain = √(2 / (1 + a ^ 2))) can be provided for the init keyword argument of the layer constructor.

Bias initialization: PyTorch initializes bias parameters with uniformly random values between +/- 1 / √(fan_in), where fan_in in Flux is first(nfan(filter..., cin÷groups, cout)) for Conv layers. For Dense layers, last(nfan(out, in)) instead. Flux initializes them all to zero.

Layers: In PyTorch, there are separate objects for different dimensionality (e.g. conv1d, conv2d, conv3d). In Flux, the dimensionality is specified by the tuple provided for the kernel of Conv.

In PyTorch, activation functions are inserted as separate steps in the chain, as equals to layers. In Flux, they are provided as an argument to the layer constructor.

Sequential => Chain

Linear => Dense

Upsample in Flux (via NNlib) is equivalent to align_corners=True with PyTorch's Upsample, but the default there is False. Note that this makes the gradients depend on image size.

When building a custom layer, the (::MyLayer)(input) = method is the equivalent of def forward(self, input):

Often if PyTorch has a method, Flux (many via MLUtils.jl) has the same method. e.g. unsqueeze to insert a 1-length dimension, or erf to compute the error function. Note that the inverse of unsqueeze is actually Base.dropdims.

Training: A single step in Flux is simply gradient followed by update!. In PyTorch there are more steps: the optimizer's gradients must be zeroed with .zero_grad(), then the loss is calculated, then the tensor returned from the loss function is backward-propagated with .backward(), and finally the optimizer is stepped forward with .step(). In Flux, both parts can be combined with the train! function, and can also be used to iterate over a set of paired training inputs and outputs.

In Flux, an optimizer state object is first obtained by setup, and this state is passed to the training loop. In PyTorch, the optimizer object itself is manipulated in the training loop.

FluxML / Flux.jl

Collecting PyTorch -> Flux migration notes #2410