bentrevett / pytorch-seq2seq

Tutorials on implementing a few sequence-to-sequence (seq2seq) models with PyTorch and TorchText.
MIT License
5.36k stars 1.33k forks source link

need help with 6th tutorial (transformer). Assert error at train when changing dataset. #57

Closed GorkaUrbizu closed 4 years ago

GorkaUrbizu commented 4 years ago

Hi,

I have an issue when I train the transformer model on another dataset.

I have 2 datasets for a different task, dataset A: 15K examples (gold data) dataset B: 500K examples (semi-supervised)

everything works perfectly with both datasets when I used the RNN seq2seq 4th tutorial code.

When I train the transformer with dataset A, i don't have any problem either, but when I train the transformer with dataset B i get the next assert error:

---------------------------------------------------------------------------

RuntimeError                              Traceback (most recent call last)

<ipython-input-30-a1cac82f9e9b> in <module>()
     11     start_time = time.time()
     12 
---> 13     train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
     14     valid_loss = evaluate(model, valid_iterator, criterion)
     15 

2 frames

<ipython-input-26-cbf35e313be4> in train(model, iterator, optimizer, criterion, clip)
     25         loss = criterion(output, trg)
     26 
---> 27         loss.backward()
     28 
     29         torch.nn.utils.clip_grad_norm_(model.parameters(), clip)

/usr/local/lib/python3.6/dist-packages/torch/tensor.py in backward(self, gradient, retain_graph, create_graph)
    164                 products. Defaults to ``False``.
    165         """
--> 166         torch.autograd.backward(self, gradient, retain_graph, create_graph)
    167 
    168     def register_hook(self, hook):

/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
     97     Variable._execution_engine.run_backward(
     98         tensors, grad_tensors, retain_graph, create_graph,
---> 99         allow_unreachable=True)  # allow_unreachable flag
    100 
    101 

RuntimeError: transform: failed to synchronize: cudaErrorAssert: device-side assert triggered

I didn't change too much the code, I read my dataset, and have set the next hiperparameters(but I get the same error with any other combination):

batch size = 32 

hid_dim = 128
n_layers = 2
n_heads = 2
pf_dim = 512
dropout = 0.1

I limited the vocab and length of sentences to avoid OOM issues.

I don't know if anyone could help me to find what is the issue here. I don't know if the implementation of the transformer has any limitation for vocab/length/dataset/encoding, that doesn't support dataset B, or I'm doing any other mistake.

Gorka

bentrevett commented 4 years ago

This error is usually due to your labels not being in the range of [0, n_classes-1].

I'll have to see more of your code to see if we can figure out the issue.

GorkaUrbizu commented 4 years ago

Here is the code, thank you in advance.

Coreference Resolution for Basque language with a Transformer  Sequence to Sequence 
*************************************************************

**Author**: [Gorka Urbizu](https://github.com/gorka96)

(based on the code of [bentrevett](https://github.com/bentrevett/pytorch-seq2seq))

This notebook trains a transformer model for Coreference Resolution for Basque.

After training the model in this notebook, you will be able to input a Basque sentence:

*   **"Garailea Europan ariko da heldu den sasoian ."**

and return it's corefenrential clusters prediction:

*   **"(1) (2) _ _ (3 _ 3) _"**

.

... to varying degrees of success.
"""

# Load the Drive helper and mount
from google.colab import drive
# This will prompt for authorization.
drive.mount('/content/drive')

import torch
import torch.nn as nn
import torch.optim as optim

import torchtext
from torchtext.data import Field, BucketIterator, TabularDataset

#import spacy
import csv

import random
import math
import os
import time

SEED = 1

random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

def load_data(file):
    with open(file) as f:
        data = []
        for line in f.readlines():
            data.append(line.strip("\n"))
    return data

def save2tsv(file1, file2, fname):
  with open(fname, 'w', newline='') as f_out:
      tsv_out = csv.writer(f_out, delimiter='\t')
      tsv_out.writerow(["src", "trg"])  
      for i in range(len(file1)): 
          if len(file1[i].split()) <= 100: 
              tsv_out.writerow([file1[i], file2[i]])

path = "drive/My Drive/Colab Notebooks/seq2seq-coref/"
d_path = path+"data/"

train_src_data = load_data(d_path+"goldLazyTrain.src")
train_trg_data = load_data(d_path+"goldLazyTrain.trg")
dev_src_data = load_data(d_path+"dev.src")
dev_trg_data = load_data(d_path+"dev.trg")
test_src_data = dev_src_data
test_trg_data = dev_trg_data

# Take a look at some sentences in data
for i in range(3):
  print(train_src_data[i].strip("\n"))
  print(train_trg_data[i].strip("\n"))  

save2tsv(train_src_data, train_trg_data, "train.tsv")
save2tsv(dev_src_data, dev_trg_data, "val.tsv")  
save2tsv(test_src_data, test_trg_data, "test.tsv") 

!echo "example from test:"
!head -3 test.tsv
!echo "number of examples for training:"
!wc -l train.tsv

SRC = Field(init_token='<sos>', eos_token='<eos>', batch_first=True)
TRG = Field(init_token='<sos>', eos_token='<eos>', batch_first=True)

train_data, valid_data, test_data =  TabularDataset.splits(path='./',
                                      format='tsv', train='train.tsv',
                                      validation='val.tsv', test='test.tsv', 
                                      fields=[('src', SRC), ('trg', TRG)], 
                                      skip_header=False)

SRC.build_vocab(train_data, min_freq=25)
TRG.build_vocab(train_data, min_freq=1)

print(f"Unique tokens in source (text) vocabulary: {len(SRC.vocab)}")
print(f"Unique tokens in target (coref) vocabulary: {len(TRG.vocab)}")

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

BATCH_SIZE = 8

train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
    (train_data, valid_data, test_data), 
     batch_size=BATCH_SIZE,
     sort_within_batch = True,
     sort_key = lambda x : len(x.src),
     device=device)

class Encoder(nn.Module):
    def __init__(self, input_dim, hid_dim, n_layers, n_heads, pf_dim, encoder_layer, self_attention, positionwise_feedforward, dropout, device):
        super().__init__()

        self.input_dim = input_dim
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        self.n_heads = n_heads
        self.pf_dim = pf_dim
        self.encoder_layer = encoder_layer
        self.self_attention = self_attention
        self.positionwise_feedforward = positionwise_feedforward
        self.dropout = dropout
        self.device = device

        self.tok_embedding = nn.Embedding(input_dim, hid_dim)
        self.pos_embedding = nn.Embedding(1000, hid_dim)

        self.layers = nn.ModuleList([encoder_layer(hid_dim, n_heads, pf_dim, self_attention, positionwise_feedforward, dropout, device) 
                                     for _ in range(n_layers)])

        self.do = nn.Dropout(dropout)

        self.scale = torch.sqrt(torch.FloatTensor([hid_dim])).to(device)

    def forward(self, src, src_mask):

        #src = [batch size, src sent len]
        #src_mask = [batch size, src sent len]

        pos = torch.arange(0, src.shape[1]).unsqueeze(0).repeat(src.shape[0], 1).to(self.device)

        src = self.do((self.tok_embedding(src) * self.scale) + self.pos_embedding(pos))

        #src = [batch size, src sent len, hid dim]

        for layer in self.layers:
            src = layer(src, src_mask)

        return src

class EncoderLayer(nn.Module):
    def __init__(self, hid_dim, n_heads, pf_dim, self_attention, positionwise_feedforward, dropout, device):
        super().__init__()

        self.ln = nn.LayerNorm(hid_dim)
        self.sa = self_attention(hid_dim, n_heads, dropout, device)
        self.pf = positionwise_feedforward(hid_dim, pf_dim, dropout)
        self.do = nn.Dropout(dropout)

    def forward(self, src, src_mask):

        #src = [batch size, src sent len, hid dim]
        #src_mask = [batch size, src sent len]

        src = self.ln(src + self.do(self.sa(src, src, src, src_mask)))

        src = self.ln(src + self.do(self.pf(src)))

        return src

class SelfAttention(nn.Module):
    def __init__(self, hid_dim, n_heads, dropout, device):
        super().__init__()

        self.hid_dim = hid_dim
        self.n_heads = n_heads

        assert hid_dim % n_heads == 0

        self.w_q = nn.Linear(hid_dim, hid_dim)
        self.w_k = nn.Linear(hid_dim, hid_dim)
        self.w_v = nn.Linear(hid_dim, hid_dim)

        self.fc = nn.Linear(hid_dim, hid_dim)

        self.do = nn.Dropout(dropout)

        self.scale = torch.sqrt(torch.FloatTensor([hid_dim // n_heads])).to(device)

    def forward(self, query, key, value, mask=None):

        bsz = query.shape[0]

        #query = key = value [batch size, sent len, hid dim]

        Q = self.w_q(query)
        K = self.w_k(key)
        V = self.w_v(value)

        #Q, K, V = [batch size, sent len, hid dim]

        Q = Q.view(bsz, -1, self.n_heads, self.hid_dim // self.n_heads).permute(0, 2, 1, 3)
        K = K.view(bsz, -1, self.n_heads, self.hid_dim // self.n_heads).permute(0, 2, 1, 3)
        V = V.view(bsz, -1, self.n_heads, self.hid_dim // self.n_heads).permute(0, 2, 1, 3)

        #Q, K, V = [batch size, n heads, sent len, hid dim // n heads]

        energy = torch.matmul(Q, K.permute(0, 1, 3, 2)) / self.scale

        #energy = [batch size, n heads, sent len, sent len]

        if mask is not None:
            energy = energy.masked_fill(mask == 0, -1e10)

        attention = self.do(torch.softmax(energy, dim=-1))

        #attention = [batch size, n heads, sent len, sent len]

        x = torch.matmul(attention, V)

        #x = [batch size, n heads, sent len, hid dim // n heads]

        x = x.permute(0, 2, 1, 3).contiguous()

        #x = [batch size, sent len, n heads, hid dim // n heads]

        x = x.view(bsz, -1, self.n_heads * (self.hid_dim // self.n_heads))

        #x = [batch size, src sent len, hid dim]

        x = self.fc(x)

        #x = [batch size, sent len, hid dim]

        return x

class PositionwiseFeedforward(nn.Module):
    def __init__(self, hid_dim, pf_dim, dropout):
        super().__init__()

        self.hid_dim = hid_dim
        self.pf_dim = pf_dim

        self.fc_1 = nn.Linear(hid_dim, pf_dim)
        self.fc_2 = nn.Linear(pf_dim, hid_dim)

        self.do = nn.Dropout(dropout)

    def forward(self, x):

        #x = [batch size, sent len, hid dim]

        x = self.do(torch.relu(self.fc_1(x)))

        #x = [batch size, sent len, pf dim]

        x = self.fc_2(x)

        #x = [batch size, sent len, hid dim]

        return x

class Decoder(nn.Module):
    def __init__(self, output_dim, hid_dim, n_layers, n_heads, pf_dim, decoder_layer, self_attention, positionwise_feedforward, dropout, device):
        super().__init__()

        self.output_dim = output_dim
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        self.n_heads = n_heads
        self.pf_dim = pf_dim
        self.decoder_layer = decoder_layer
        self.self_attention = self_attention
        self.positionwise_feedforward = positionwise_feedforward
        self.dropout = dropout
        self.device = device

        self.tok_embedding = nn.Embedding(output_dim, hid_dim)
        self.pos_embedding = nn.Embedding(1000, hid_dim)

        self.layers = nn.ModuleList([decoder_layer(hid_dim, n_heads, pf_dim, self_attention, positionwise_feedforward, dropout, device)
                                     for _ in range(n_layers)])

        self.fc = nn.Linear(hid_dim, output_dim)

        self.do = nn.Dropout(dropout)

        self.scale = torch.sqrt(torch.FloatTensor([hid_dim])).to(device)

    def forward(self, trg, src, trg_mask, src_mask):

        #trg = [batch_size, trg sent len]
        #src = [batch_size, src sent len]
        #trg_mask = [batch size, trg sent len]
        #src_mask = [batch size, src sent len]

        pos = torch.arange(0, trg.shape[1]).unsqueeze(0).repeat(trg.shape[0], 1).to(self.device)

        trg = self.do((self.tok_embedding(trg) * self.scale) + self.pos_embedding(pos))

        #trg = [batch size, trg sent len, hid dim]

        for layer in self.layers:
            trg = layer(trg, src, trg_mask, src_mask)

        return self.fc(trg)

class DecoderLayer(nn.Module):
    def __init__(self, hid_dim, n_heads, pf_dim, self_attention, positionwise_feedforward, dropout, device):
        super().__init__()

        self.ln = nn.LayerNorm(hid_dim)
        self.sa = self_attention(hid_dim, n_heads, dropout, device)
        self.ea = self_attention(hid_dim, n_heads, dropout, device)
        self.pf = positionwise_feedforward(hid_dim, pf_dim, dropout)
        self.do = nn.Dropout(dropout)

    def forward(self, trg, src, trg_mask, src_mask):

        #trg = [batch size, trg sent len, hid dim]
        #src = [batch size, src sent len, hid dim]
        #trg_mask = [batch size, trg sent len]
        #src_mask = [batch size, src sent len]

        trg = self.ln(trg + self.do(self.sa(trg, trg, trg, trg_mask)))

        trg = self.ln(trg + self.do(self.ea(trg, src, src, src_mask)))

        trg = self.ln(trg + self.do(self.pf(trg)))

        return trg

class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, sos_idx, pad_idx, device, maxlen=100):
        super().__init__()

        self.encoder = encoder
        self.decoder = decoder
        self.sos_idx = sos_idx
        self.pad_idx = pad_idx
        self.device = device
        self.maxlen = maxlen

    def make_masks(self, src, trg):

        #src = [batch size, src sent len]
        #trg = [batch size, trg sent len]

        src_mask = (src != self.pad_idx).unsqueeze(1).unsqueeze(2)

        trg_pad_mask = (trg != self.pad_idx).unsqueeze(1).unsqueeze(3)

        #src_mask = [batch size, 1, 1, src sent len]
        #trg_pad_mask = [batch size, 1, trg sent len, 1]

        trg_len = trg.shape[1]

        trg_sub_mask = torch.tril(torch.ones((trg_len, trg_len), device=self.device)).bool()

        #trg_sub_mask = [trg sent len, trg sent len]

        trg_mask = trg_pad_mask & trg_sub_mask

        #trg_mask = [batch size, 1, trg sent len, trg sent len]

        return src_mask, trg_mask

    def forward(self, src, trg):

        #src = [batch size, src sent len]
        #trg = [batch size, trg sent len]

        src_mask, trg_mask = self.make_masks(src, trg)

        enc_src = self.encoder(src, src_mask)

        #enc_src = [batch size, src sent len, hid dim]

        out = self.decoder(trg, enc_src, trg_mask, src_mask)

        #out = [batch size, trg sent len, output dim]

        return out

    def translate_sequences(self, src):
        #src = [batch size, src sent len]

        batch_size, src_len = src.shape
        trg = src.new_full((batch_size, 1), self.sos_idx)
        #trg = [batch size, 1]
        src_mask, trg_mask = self.make_masks(src, trg)

        enc_src = self.encoder(src, src_mask)

        #enc_src = [batch size, src sent len, hid dim]

        translation_step = 0
        while translation_step < src_len-1: #self.maxlen
            out = self.decoder(trg, enc_src, trg_mask, src_mask)
            # out - [batch size, trg sent len, output dim]
            out = torch.argmax(out[:, -1], dim=1) # batch size
            out = out.unsqueeze(1) # batch size, 1
            trg = torch.cat((trg, out), dim=1)
            # trg - [batch size, trg sent len]
            src_mask, trg_mask = self.make_masks(src, trg)
            translation_step += 1
        return trg

input_dim = len(SRC.vocab)

hid_dim = 128
n_layers = 2
n_heads = 2
pf_dim = 512
dropout = 0.1

enc = Encoder(input_dim, hid_dim, n_layers, n_heads, pf_dim, EncoderLayer, SelfAttention, PositionwiseFeedforward, dropout, device)

output_dim = len(TRG.vocab)

hid_dim = 128
n_layers = 2
n_heads = 2
pf_dim = 512
dropout = 0.1

dec = Decoder(output_dim, hid_dim, n_layers, n_heads, pf_dim, DecoderLayer, SelfAttention, PositionwiseFeedforward, dropout, device)

pad_idx = SRC.vocab.stoi['<pad>']
sos_idx = SRC.vocab.stoi['<sos>']

model = Seq2Seq(enc, dec, sos_idx, pad_idx, device).to(device)

def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

for p in model.parameters():
    if p.dim() > 1:
        nn.init.xavier_uniform_(p)

class NoamOpt:
    "Optim wrapper that implements rate."
    def __init__(self, model_size, factor, warmup, optimizer):
        self.optimizer = optimizer
        self._step = 0
        self.warmup = warmup
        self.factor = factor
        self.model_size = model_size
        self._rate = 0

    def step(self):
        "Update parameters and rate"
        self._step += 1
        rate = self.rate()
        for p in self.optimizer.param_groups:
            p['lr'] = rate
        self._rate = rate
        self.optimizer.step()

    def rate(self, step = None):
        "Implement `lrate` above"
        if step is None:
            step = self._step
        return self.factor * \
            (self.model_size ** (-0.5) *
            min(step ** (-0.5), step * self.warmup ** (-1.5)))

    def zero_grad(self):
        self.optimizer.zero_grad()

optimizer = NoamOpt(hid_dim, 1, 2000,
            torch.optim.Adam(model.parameters(), lr=0, betas=(0.9, 0.98), eps=1e-9))

criterion = nn.CrossEntropyLoss(ignore_index=pad_idx)

def train(model, iterator, optimizer, criterion, clip):

    model.train()

    epoch_loss = 0

    for i, batch in enumerate(iterator):

        src = batch.src
        trg = batch.trg

        optimizer.zero_grad()

        output = model(src, trg[:,:-1])

        #output = [batch size, trg sent len - 1, output dim]
        #trg = [batch size, trg sent len]

        output = output.contiguous().view(-1, output.shape[-1])
        trg = trg[:,1:].contiguous().view(-1)

        #output = [batch size * trg sent len - 1, output dim]
        #trg = [batch size * trg sent len - 1]

        loss = criterion(output, trg)

        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)

        optimizer.step()

        epoch_loss += loss.item()

    return epoch_loss / len(iterator)

def evaluate(model, iterator, criterion):

    model.eval()

    epoch_loss = 0

    with torch.no_grad():

        for i, batch in enumerate(iterator):

            src = batch.src
            trg = batch.trg

            output = model(src, trg[:,:-1])

            #output = [batch size, trg sent len - 1, output dim]
            #trg = [batch size, trg sent len]

            output = output.contiguous().view(-1, output.shape[-1])
            trg = trg[:,1:].contiguous().view(-1)

            #output = [batch size * trg sent len - 1, output dim]
            #trg = [batch size * trg sent len - 1]

            loss = criterion(output, trg)

            epoch_loss += loss.item()

    return epoch_loss / len(iterator)

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

!nvidia-smi

N_EPOCHS = 100
CLIP = 1

in_ear_st = 5
early_stop = in_ear_st

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()

    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, valid_iterator, criterion)

    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)

    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'transCR_model.pt')
        early_stop = in_ear_st
    else:    
        early_stop -= 1
        if early_stop == 0: 
            break

    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

model.load_state_dict(torch.load('transCR_model.pt'))
model.eval()

test_loss = evaluate(model, test_iterator, criterion)

print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |')
bentrevett commented 4 years ago

The code seems to run fine when I use it with the Multi30K dataset. So there is an issue with your dataset B.

The weird error message is due to CUDA processing data asynchronously, hence the error trace gives an ambiguous message and usually doesn't point to the line containing the actual error.

Try running the model on the CPU by changing:

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

to

device = torch.device('cpu')

Are you running this on Google Colab? It might also be some Colab error but I am not sure.

bentrevett commented 4 years ago

One quick thing to check is that none of your sequences are longer than 998 tokens. The positional embeddings are initialized by self.pos_embedding = nn.Embedding(1000, hid_dim). This means they have a maximum length of 1000 tokens.

However, we also append an <sos> and <eos> token to every tensor before putting it in the model, so if your sequence is longer 998 with those appended it'll be >1000 tokens long and the positional embedding will throw an error.

Again, because of the way CUDA processes things asynchronously (and has terrible error messages) this is why you get the error trace you do.

Try running the model on the CPU by changing device as I've mentioned above and it should give you a more reasonable error message.

GorkaUrbizu commented 4 years ago

Thanks ben,

There isn't any problem with the length of the sequence, as I decided to limit the length of the sentences at 100 tokens, because I had some sequences longer than 1000 tokens.

And yes, I'm running this at colab. I will run it on the CPU, and I will report the result/error message that I got.

GorkaUrbizu commented 4 years ago

I got this error, as you pointed out, and as the error suggests, the problem was on the length...

I was checking only the source data length and not the target sequence length (they were supposed to be of the same length, but I had a problem in the alignment). I changed that and everything runs perfectly. Thanks for your help and time!

RuntimeError: index out of range: Tried to access index 1000 out of table with 999 rows. at /pytorch/aten/src/TH/generic/THTensorEvenMoreMath.cpp:418