Closed Ninja16180 closed 4 years ago
There shouldn't be a need to do a +1 anywhere. I am not sure why that solves the issue you linked, I believe that's some other bug in their code.
The issue here is from the positional embedding, not the token embedding. The positional embedding has a "vocabulary size" of 100, which means the maximum length sequence can be passed to it is 100 tokens, because of the way the position indexes are created:
pos = torch.arange(0, trg_len).unsqueeze(0).repeat(batch_size, 1).to(self.device)
If you pass a sequence that is >100 tokens long, the pos
tensor will contain values that are greater than the number of rows in your embedding matrix, hence you will get an index out of range error.
I actually should have added this to the tutorials, but there needs to be a part where you cut your sentences to a maximum length, which should equal the "vocab size" of your positional embeddings, i.e. the max_length
argument passed to the Encoder
and Decoder
.
One way for you to double check this is the error is to just print out the sizes of the tensors input into your model by adding print(src.shape)
at the start of the encoder's forward method and print(trg.shape)
for the decoder. I predict you will see it print out a tensor that is >100 elements long just before it throws the error.
One way to solve this is by setting a max length argument in the tokenizers:
def tokenize_de(text, max_length=100):
return [tok.text for tok in spacy_de.tokenizer(text)][:max_length-2]
def tokenize_en(text, max_length=100):
return [tok.text for tok in spacy_en.tokenizer(text)][:max_length-2]
Note that we have to do max_length-2
as our source and target sentences have <sos>
and <eos>
tokens appended to them.
Let me know if this solves the issue.
Thanks a lot Ben for your prompt and elaborate response.
Yes,as you rightly predicted, the input tensor size is >100(as given below): torch.Size([33, 128])
However, I set the max_length argument in the tokenizer functions as you told but unfortunately, still the same runtime error came:
RuntimeError Traceback (most recent call last)
OK, I've found the error.
Transformers expect the data to be [batch_size, seq_len]
, note the batch is the first dimension. In PyTorch, RNNs expect the batch to be the second dimension, because of RNNs common usage in NLP, TorchText by default returns batches shaped [seq_len, batch_size]
.
The issue here is that your batch and sequence length dimensions are flipped (you have batch second, as default, but the transformer should be fed examples with the batch first) and as you're using a batch size >100 you are getting the out of index error.
Luckily, it's an easy fix. Where you define your fields you need to set batch_first = True
, so they should look like:
SRC = Field(tokenize = tokenize_de,
init_token = '<sos>',
eos_token = '<eos>',
lower = True,
batch_first = True) #added this
TRG = Field(tokenize = tokenize_en,
init_token = '<sos>',
eos_token = '<eos>',
lower = True,
batch_first = True) #added this
With this, you don't actually need the tokenizer cutting things down to max_length-2
anymore as there are no examples in the Multi30k dataset that are >100 tokens.
I'll try and stress further in the tutorials that the batch needs to be first for the transformer model.
Thanks a mil! This solution worked!
I am thus closing this issue
Hi Ben, Thanks for these awesome tutorials on pytorch. While following the last one of the series: 6 - Attention is All You Need, I came across nn.Embedding related Runtime error with positional encoding layer during training.
I am executing the code in Google Colab.
Detailed error description:
RuntimeError Traceback (most recent call last)