Closed lunixbochs closed 3 years ago
These are the shapes from a TDS block in wav2letter
tds input shape [ 111 60 15]
fc input shape [ 111 60 15]
Reorder (2,1,0,3)
shape [ 15 60 111]
View (900 -1 1 0)
shape [ 900 111]
Linear (900->900) (with bias)
shape [ 900 111]
ReLU
shape [ 900 111]
Dropout (0.100000)
shape [ 900 111]
Linear (900->900) (with bias)
shape [ 900 111]
View (15 60 -1 0)
shape [ 15 60 111]
Reorder (2,1,0,3)
shape [ 111 60 15]
Dropout (0.100000)
shape [ 111 60 15]
tds output shape [ 111 60 15]
vs my pytorch forward pass (on different audio)
[+] running forward pass
input: torch.Size([1, 60, 198])
[-] layer View([0, 1, 60, -1])
output: torch.Size([1, 1, 60, 198])
[-] layer ConstantPad1d(padding=[5, 3], value=0)
output: torch.Size([1, 1, 60, 206])
[-] layer Conv2d(1, 15, kernel_size=(1, 10), stride=(1, 2), padding=(1, 0))
output: torch.Size([1, 15, 62, 99])
[-] layer ReLU()
output: torch.Size([1, 15, 62, 99])
[-] layer Dropout(p=0.1, inplace=False)
output: torch.Size([1, 15, 62, 99])
[-] layer ConstantPad2d(padding=(7, 1, 0, 0), value=0)
output: torch.Size([1, 15, 62, 107])
[-] layer Conv2d(15, 15, kernel_size=(1, 9), stride=(1, 1))
output: torch.Size([1, 15, 62, 99])
[-] layer ReLU()
output: torch.Size([1, 15, 62, 99])
[-] layer Dropout(p=0.1, inplace=False)
output: torch.Size([1, 15, 62, 99])
[-] layer Reorder(1, 2, 3, 0)
output: torch.Size([15, 62, 99, 1])
[-] layer View([0, 1, -1, 900])
Traceback (most recent call last):
File "model.py", line 385, in <module>
emissions = w2l.forward(frames)
File "model.py", line 305, in forward
input = layer(input)
File "/Users/aegis/Library/Python/3.7/lib/python/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "model.py", line 21, in forward
return input.view(*shape)
RuntimeError: shape '[15, 1, -1, 900]' is invalid for input of size 92070
Hey @lunixbochs,
- Dimension ordering:
About conv GLU layers with weight norm I guess this will be helpful for you https://github.com/facebookresearch/wav2letter/tree/master/recipes/models/utilities/convlm_serializer (how we did import trained fairseq conv glu into w2l bin), like https://github.com/facebookresearch/wav2letter/blob/master/recipes/models/utilities/convlm_serializer/Utils.cpp#L107 and https://github.com/facebookresearch/wav2letter/blob/master/recipes/models/utilities/convlm_serializer/Utils.cpp#L114.
The thing you need to have in mind is that saved af::array is column major. So if you have tensor in pytorch with shape HxWxCxN in row-major ordering the it will be loaded into arrayfire as NxCxWxH tensor and vice versa. Let me know if this what you need to solve the issue.
- LayerNorm
LN 0 1 2 will compute normalizations along all axis except batch. The thing is that pytorch has another oredering of input tensors for the same operations, for example you can implement linear layer in two ways: y = Ax or y = xA. This one of the differences of pytorch and flashlight. So you can check input tensors for each operation (linear, conv, etc.). What you need to do is just have pytorch tensors but do the same operations, results will be the same, you just need to transpose/reorder things properly to have the same computations. For example to apply layer norm in TDS (I guess in pytorch it is done for last axes because of efficiency) you just need to call pytorch layer norm with input in format BxHxWxC (or any permutation of last 3 axes) with normalization along 3 last axes. This will be equivalent to the flashlight operation.
About TDS block view operation - let me recheck this, will come back soon on this.
So for arrayfire TDS:
tds input shape [ 111 60 15]
fc input shape [ 111 60 15]
Reorder (2,1,0,3)
shape [ 15 60 111]
View (900 -1 1 0)
shape [ 900 111]
Linear (900->900) (with bias)
shape [ 900 111]
ReLU
shape [ 900 111]
Dropout (0.100000)
shape [ 900 111]
Linear (900->900) (with bias)
shape [ 900 111]
View (15 60 -1 0)
shape [ 15 60 111]
Reorder (2,1,0,3)
shape [ 111 60 15]
Dropout (0.100000)
shape [ 111 60 15]
tds output shape [ 111 60 15]
The weights are going in as HWCN, so
H=111 W=60 C=15 N=1
Reorder (2 1 0 3) swaps height and channel
H=15 W=60 C=111 N=1
View (900 -1 1 0) makes (H=900, C=1, N=N), and W = total / (H*C*N)
H=900 W=111 C=1 N=1
Then there's a Linear 900->inner->900 block which preserves the H size.
View (15 60 -1 0) makes (H=15 W=60 N=N) and C = total / (H*W*N)
H=15 W=60 C=111 N=1
Reorder (2,1,0,3) swaps height and channel again
H=111 W=60 C=15 N=1
and pytorch (NCHW) rough shape notes:
input: torch.Size([1, 15, 62, 99])
# at this point, H and W appear to be switched from flashlight
# also H is incorrectly 62 instead of 60 for some reason (I think accidental padding on previous layer)
N=1 C=15 H=62 W=99
Reorder (1 2 3 0) # seems wrong, should be 0 2 1 3 to swap height and channel or 0 3 2 1 to swap W/C
N=1 C=99 H=62 W=15
View (0 1 -1 900) # H/W are swapped, should make (H=900, C=1, N=N), and W = total / (H*C*N)
N=1 C=1 H=900 W=~102
Linear 900 -> inner -> 900
View (0 -1 60 15) # H/W are swapped, should make (H=15, W=60, N=N), and C = total / (H*W*N)
N=1 C=99 H=60 W=15
Reorder (1, 2, 3, 0) # seems wrong, should be 0 2 1 3 to swap height and channel or 0 3 2 1 to swap W/C
N=1 C=15 H=60 W=99
Oh! Is row-major "width" in pytorch the same axis as column-major "height" in arrayfire? That would explain some of my confusion.
Ok, I got LayerNorm working, however the fully connected section of TDS is broken with my current import strategy.
The input shape into this TDS block is torch.Size([1, 15, 60, 187])
This is what a TDS block looks like:
TDS 15 9 60 0.100000 0 1 0
Sequential(
(0): Sequential(
(0): ConstantPad2d(padding=(7, 1, 0, 0), value=0)
(1): Conv2d(15, 15, kernel_size=(1, 9), stride=(1, 1))
(2): ReLU()
(3): Dropout(p=0.1, inplace=False)
)
(1): InnerLayerNorm()
(2): Sequential(
(0): Reorder(0, 3, 1, 2)
(1): View([0, 1, -1, 900])
(2): Linear(in_features=900, out_features=900, bias=True)
(3): ReLU()
(4): Dropout(p=0.1, inplace=False)
(5): Linear(in_features=900, out_features=900, bias=True)
(6): ReLU()
(7): View([0, -1, 15, 60])
(8): Reorder(0, 2, 3, 1)
(9): Dropout(p=0.1, inplace=False)
)
(3): InnerLayerNorm()
)
My activations match wav2letter until the FC block inside the TDS (layer 2 in the outer sequential shown here), after which they diverge, so I'd guess my reorder/view step is wrong.
I got the Linear to work by transposing the weights, but LayerNorm is definitely broken. The first couple of LayerNorms seem to match, using this as the module forward pass:
tmp = input.permute(0, 3, 2, 1)
shape = tmp.shape[2:]
tmp = F.layer_norm(tmp, shape, self.weight.expand(shape), self.bias.expand(shape), self.eps)
return tmp.permute(0, 3, 2, 1)
But after the first TDS finishes, the LayerNorm at the start of the next TDS block stops matching wav2letter. I implemented a basic LN 1 2
by hand and it doesn't seem to help:
class DimLayerNorm(nn.Module):
def __init__(self, *dims: int, eps=1e-5):
super().__init__()
self.dims = tuple(dims)
self.eps = eps
self.weight = nn.Parameter(torch.Tensor(1))
self.bias = nn.Parameter(torch.Tensor(1))
self.reset_parameters()
def reset_parameters(self):
nn.init.ones_(self.weight)
nn.init.zeros_(self.weight)
def forward(self, input):
print('ln', input.shape, self.dims, self.weight, self.bias)
mean = input.mean(self.dims, keepdim=True)
std = input.var(self.dims, keepdim=True)
out = (input - mean) / (std + self.eps).sqrt()
return out * self.weight + self.bias
def __repr__(self):
return 'DimLayerNorm({})'.format(self.dims)
Do you have insight into what the flashlight layernorm (which calls BatchNorm and MKL internally) might be doing differently from this? Or is this close enough I'm probably failing to notice a problem somewhere else?
But after the first TDS finishes, the LayerNorm at the start of the next TDS block stops matching wav2letter. I implemented a basic
LN 1 2
by hand and it doesn't seem to help:
Ryan, could you post the sizes which you have in w2l and in pytorch before and the time they stop to match?
For example in the sota models we do LN 0 1 2
https://github.com/facebookresearch/wav2letter/blob/master/recipes/models/sota/2019/am_arch/am_tds_s2s.arch#L6.
Your by hand implementation look good for me. cc @vineelpratap to recheck this.
I've been doing more investigation, I think the issue is somewhere besides LayerNorm if my hand-rolled version roughly matches. I think LayerNorm averaging the activations may be bringing a divergence somewhere in the middle to the edge where it's more visible, which is why I thought the LayerNorm was causing it. My next step is going to be diffing the activations layer by layer between w2l/pytorch layer by layer and seeing where it diverges, so I probably won't expect any help until I have that done.
@lunixbochs - Here is a PyTorch implementation of TDS that you can refer to https://gist.github.com/vineelpratap/e9c030d488c5f2b804215c547d573932
Regarding LayerNorm, there are two things that you might want to take note:
I'm working on a pytorch loader for my model format (https://github.com/facebookresearch/wav2letter/issues/718)
I have conv_glu models working by trial/error, but I'm confused on TDS. Any help would be appreciated!
TDS has an inner view/reorder.
I'm working with this arch:
These are the first few pytorch model layers:
Here's a forward pass: