Implementation Comparison on conv1d and conv2d for SM and MP-CNN

Impavidity commented 6 years ago

I think there is something we need to discuss on the implementation of SM-CNN and MP-CNN model. For the convolution net, these two models use nn.conv1d and pass first argument as word_dim. However, according the API, the first argument is number of input_channels. In these two model, this argument should be set as 1. nn.conv2d should be used in these two models.

Use SM-CNN model as an example: The output of model parameters is as followed:

QAModel (
  (conv_q): Sequential (
    (0): Conv1d(50, 100, kernel_size=(5,), stride=(1,), padding=(4,))
    (1): Tanh ()
  )
  (conv_a): Sequential (
    (0): Conv1d(50, 100, kernel_size=(5,), stride=(1,), padding=(4,))
    (1): Tanh ()
  )
  (combined_feature_vector): Linear (204 -> 204)
  (combined_features_activation): Tanh ()
  (dropout): Dropout (p = 0.5)
  (hidden): Linear (204 -> 2)
  (logsoftmax): LogSoftmax ()
)

In the forwarding procedure, we have the following input size.

input size: torch.Size([1, 50, 20])

First dimension is batch size. Second dimension is word dimension. Third dimension is sentence length. This tensor is transposed when constructed. This tensor is put in conv1d as declared above.

Now, let see the example in pytorch document.

m = nn.Conv1d(16, 33, 3, stride=2)
input = autograd.Variable(torch.randn(20, 16, 50))
output = m(input)

API says:

Input: (N,C_in,L_in)
Output: (N,C_out,L_out)

The input format should be N = batch size, C_in should be input channel number, L_in should be the length. Output format : N = batch size, C_out should be the output channel number. L_out is calculated by L_in, kernel size, padding and stride.

According this, sm model's input shape (1, 50, 20) might not match the API. SM model regards input channel number as 50 which is wrong.

After the wrong conv1d, we get a tensor with following shape

after conv 1d sequential: torch.Size([1, 100, 24])

According output format, we say batch size = 1, output channel number = 100 and output length is 24 which is calculated from input length 20, stride = 1, kernel size = 5 and padding 4. And after maxpooling, sm model get the max value in third dimension and get tensor with shape

after max pool 1d: torch.Size([1, 100, 1])
after reshape: torch.Size([1, 100])

In this case, we see each dimension of word embedding as a channel. But according the original paper, kernel size should be (width, word_dim) . So my suggestion is :

nn.Conv2d(input_channel=1, output_channel=100, (5, words_dim), padding=(4,0))

and the input size should be

(1, 1, 20, 50)

where batch = 1, number of input channel = 1, sentence length = 20, word_dim = 50. After the conv net, we will get tensor size with

(1, 100 , 24)

which is totally same as before but with different meanings, I guess. And MP-CNN model seem have similar issue. But which idea is better, it depends on experiments and datasets.

tuzhucheng commented 6 years ago

Great observation. I have one question, in NLP can each dimension in the sentence embedding be referred to as a channel? If so does this imply we can use Conv1d? However, I'm aware many tutorials by reputable people use Conv2d for text CNNs.

In Yoav Goldberg's A Primer on Neural Network Models for Natural Language Processing he writes:

The main idea behind a convolution and pooling architecture for language tasks is to apply a non-linear (learned) function over each instantiation of a k-word sliding window over the sentence. This function (also called “filter”) transforms a window of k words into a d dimensional vector that captures important properties of the words in the window (each dimension is sometimes referred to in the literature as a “channel”)

Here are tutorials that use Conv2d:

Impavidity commented 6 years ago

But as I know, most of the people followed the idea of using Conv2d for text CNNs. But if you find that the Conv1d is better on many tasks (maybe fine tuning those parameters on these two models are needed), it might be another good observation and we might find something here.

rosequ commented 6 years ago

On a general note, when you say multiple channels, you have the same conv filters running over the input space. The word embeddings corresponding to each word, though individually are vectors, compose the input matrix and the convolution should be run over this input matrix rather than the embeddings of the word. As per the current code, we convolve over individual vectors and not the matrix. Although this gives comparable numbers, this is not exactly what was done by S&M. As a simple example, this figure, the filter is in 2d but in the code we have it in 1D.

tuzhucheng commented 6 years ago

In computer vision, each image can have multiple channels (RGB as opposed to intensity). Each channel is a 2D tensor of numbers. This means each image is a 3D tensor, and combined with batching, the input is a 4D tensor (must). In NLP, word embedding is a 2D tensor / matrix - we can say it has 1 channel or word vector dimension number of channels (if I understand Yoav's paragraph above correctly) - and depending on this we can have 4D or 3D tensor as input.

Most text CNNs I find online do use Conv2d. Here is one from the official Keras example that use 1D: https://github.com/fchollet/keras/blob/master/examples/pretrained_word_embeddings.py#L131

The underlying implementation of Conv1d and Conv2d are very similar (both use ConvNd), so I do not expect them to have big difference in performance. However, this is just my hypothesis - maybe we need more experiments to find out.

I think it is more about convention - currently most people use Conv2d as opposed to Conv1d.

Impavidity commented 6 years ago

Sorry for misunderstanding. I know the implementation idea of Conv1d now. It is fine for both cases. We can directly regard it Conv1 implementation as detailed Conv2 implementation. Close this issue.

castorini / castor

Implementation Comparison on conv1d and conv2d for SM and MP-CNN #49