aask1357 / hilcodec

High fidelity, lightweight, end-to-end, streaming, convolution-based neural audio codec
MIT License
74 stars 7 forks source link

Order of activation layer and weight normalization of conv layer. #4

Closed yukara-ikemiya closed 3 days ago

yukara-ikemiya commented 4 days ago

Thank you for publishing an interesting work on neural codec. I'd like to ask the following questions regarding implementation.

1. Do we need activation function in SpecBlock? In the paper, it is written that an ELU activation is inserted before every convolution layer, but in the actual implementation, it seems that the activation function is not used in the SpecBlock. https://github.com/aask1357/hilcodec/blob/98bd530028470d0c4519de19ebce23e66ec6556d/models/hilcodec/modules/seanet.py#L181 I'd like to know if it is intentional, and If so, could you explain the reason for not using an activation function?

2. Order of activation function in downsample/upsample blocks. From my understanding, the nonlinearity argment switches LeCun / He initialization explained in Appendix C. For He initialization, I think the activation function should be applied after Conv layer, but in the implementation, He initializaition is applied to 1x1 Conv layer which doesn't have subsequent activation function here. https://github.com/aask1357/hilcodec/blob/98bd530028470d0c4519de19ebce23e66ec6556d/models/hilcodec/modules/seanet.py#L329 For me, it is natural to implement this block in this order.

downsample = nn.Sequential(
                scale_layer,
                SConv1d(..., nonlinearity='relu'),
                act(inplace=True, **activation_params),
                SConv1d(..., nonlinearity='linear'),
            )

But I might misunderstand something. Could you explain how this implementation with preceding activation function works?

3. Gap between the actual activation function and He initialization. In the implementaiton, He initizaliation is applied with nonlinearity="relu" setting, but the actual activation function is ELU instead of ReLU function. Is it possible that this gap could negatively affect the learning process?

aask1357 commented 3 days ago

Before we start, we want to say that our model adopts a pre-activation ordering ("Act->Conv") instead of a post-activation ("Conv->Act"). This is just an architectural choice. You may try the post-activation ordering. If you are interested, you can read the paper https://arxiv.org/pdf/1603.05027

1. Do we need activation function in SpecBlock? As you said, there's no activation function at the SpecBlock (STFT -> Log -> normalize -> 1x1Conv). There's no reason to insert an activation before the 1x1Conv. If our model follows post-activation ordering, it is natural to add an activation after the 1x1Conv in the SpecBlock.

2. Order of activation function in downsample/upsample blocks. You're right. The nonlinearity argument switches between LeCun / He initialization. The position of the activation doesn't matter. We describe the reason below. A pre-activation block with an input $X$, an output $Z$, and a weight matrix $W$ can be written as (1) $$Z=WY=W\mathbf{ReLU}(X).$$ Let $x$, $y$, and $z$ represent the random variables of each element in $X$, $Y$, and $Z$. Assume $x \sim N(0, 1)$, $w \sim N(0, \sigma_1^2)$, $y$ and $w$ are independent, and $y^2$ and $w^2$ are independent. Then, as in the He init paper, (2) $$\mathbf{Var}[z]=n\mathbf{Var}[w \cdot y] = n\big[(\mathbf{Var}[w]+\mathbf{E}[w]^2)(\mathbf{Var}[y]+\mathbf{E}[y]^2)-\mathbf{E}[w]^2\mathbf{E}[y]^2\big] = n\big[(\mathbf{Var}[w]+0^2)(\mathbf{E}[y^2])-0^2\mathbf{E}[y]^2\big] = n\mathbf{Var}[w]\mathbf{E}[y^2],$$ where $n$ denotes the number of filter connections (kernel_size x input_channels). Since we use ReLU as an activation function, (3) $$\mathbf{E}[y^2]=\frac{1}{2}\mathbf{E}[x^2]=\frac{1}{2}.$$ Therefore, (4) $$\mathbf{Var}[z]=\frac{n}{2}\sigma_1^2.$$ If we insert another conv layer after $Z$ whose weight has zero-mean and variance $\sigma_2^2$, the variance of the output will be (5) $$\frac{n}{2}\sigma_1^2\sigma_2^2.$$

If the model is a post-activation block as follows: (6) $$Z=\mathbf{ReLU}(WX),$$ the equation (4) in this comment remains unchanged. You may insert the activation wherever you want. In our case, we just follow the pre-activation ordering.

3. Gap between the actual activation function and He initialization. We think the effect of the misalignment is negligible. As we state in the Appendix C, it is hard to exactly follow the equation (6) in our paper. Instead, since our model is not ultra-deep, we approximately achieve the linear variance growth, which we found to be enough to ensure a proper signal propagation at the beginning of the training.

Actually, even if we use the ReLU activation function, the variance will not exactly follow the equation (6) in our paper. This is because in the equation (2) of this comment, there is an assumption that $(\mathbf{E}[w])^2=0$. However, since the sample mean of $w$ is not exactly $0$, there is a discrepancy. For exactness, you should

  1. Calculate the standard deviation of the output of the activation function (for ReLU, it is $\frac{1}{2}(1-\frac{1}{\pi})$, and for ELU, you should calculate it analytically or statistically). Then, multiply the inverse of the standard deviation by either the weight, input, or output.
  2. Make sure that the sample mean of $w$ equals zero.

If you are interested, you can read the paper https://arxiv.org/pdf/2101.08692

yukara-ikemiya commented 3 days ago

Thank you for the detailed explanation! I got the point and understood backgrounds of each implementation.