Finetuning XLM Longformer gets a lot of NaN values

pranav-ust commented 3 years ago

I am trying to follow your code for making custom longformer for XLM models (typically XLM-Roberta), however, I get NaN values as soon as I start training my models for a downstream classification. Hence, I am confused on where I am doing wrong here.

I am using Pytorch 1.6 and Transformers 3.1.

Here's my basic structure of the code, note that I am making only changes of Roberta to XLM-Roberta here:

class RobertaLongSelfAttention(LongformerSelfAttention):
    '''
    Wrapper around Longformer's self attention
    '''
    def forward(
        self,
        hidden_states,
        attention_mask=None,
        head_mask=None,
        encoder_hidden_states=None,
        encoder_attention_mask=None,
        output_attentions=False,
    ):
        return super().forward(
            hidden_states, 
            attention_mask=attention_mask, 
            output_attentions=output_attentions
        )

class RobertaLongForSequenceClassification(XLMRobertaForSequenceClassification):
    '''
    Change all layers to have longer attention
    '''
    def __init__(self, config):
        super().__init__(config)
        for i, layer in enumerate(self.roberta.encoder.layer):
            # replace the `modeling_bert.BertSelfAttention` object with `LongformerSelfAttention`
            layer.attention.self = RobertaLongSelfAttention(config, layer_id=i)

def create_long_model(save_model_to, attention_window, max_pos):
    model = XLMRobertaForSequenceClassification.from_pretrained('xlm-roberta-base',
                                                  gradient_checkpointing=True)
    tokenizer = XLMRobertaTokenizer.from_pretrained('xlm-roberta-base',
                                                    model_max_length=max_pos,
                                                    gradient_checkpointing=True)
    config = model.config

    # extend position embeddings
    tokenizer.model_max_length = max_pos
    tokenizer.init_kwargs['model_max_length'] = max_pos
    current_max_pos, embed_size = model.roberta.embeddings.position_embeddings.weight.shape
    max_pos += 2  # NOTE: RoBERTa has positions 0,1 reserved, so embedding size is max position + 2
    config.max_position_embeddings = max_pos
    assert max_pos > current_max_pos
    # allocate a larger position embedding matrix
    new_pos_embed = model.roberta.embeddings.position_embeddings.weight.new_empty(max_pos, embed_size)
    # copy position embeddings over and over to initialize the new position embeddings
    k = 2
    step = current_max_pos - 2
    while k < max_pos - 1:
        new_pos_embed[k:(k + step)] = model.roberta.embeddings.position_embeddings.weight[2:]
        k += step
    model.roberta.embeddings.position_embeddings.weight.data = new_pos_embed

    # replace the `modeling_bert.BertSelfAttention` object with `LongformerSelfAttention`
    config.attention_window = [attention_window] * config.num_hidden_layers
    for i, layer in enumerate(model.roberta.encoder.layer):
        longformer_self_attn = LongformerSelfAttention(config, layer_id=i)
        longformer_self_attn.query = layer.attention.self.query
        longformer_self_attn.key = layer.attention.self.key
        longformer_self_attn.value = layer.attention.self.value

        longformer_self_attn.query_global = layer.attention.self.query
        longformer_self_attn.key_global = layer.attention.self.key
        longformer_self_attn.value_global = layer.attention.self.value

        layer.attention.self = longformer_self_attn

    logger.info(f'saving model to {save_model_to}')
    model.save_pretrained(save_model_to)
    tokenizer.save_pretrained(save_model_to)
    return model, tokenizer

and here's how I load the model once I save it to a path. Currently I am using the position length of 1024:

logger.info(f'Converting xlm-roberta-base into xlm-roberta-base-{model_args.max_pos}')
model, tokenizer = create_long_model(
    save_model_to=model_path, attention_window=model_args.attention_window, max_pos=model_args.max_pos)

tokenizer = XLMRobertaTokenizer.from_pretrained(model_path, gradient_checkpointing=True)
model_roberta = RobertaLongForSequenceClassification.from_pretrained(model_path)

ibeltagy commented 3 years ago

NaNs are more difficult to debug, so I have a few questions to narrow it down a bit,

what bpc values did you get for XLMRoberta and XLMRobertaLong?
did you pretrain it? what bpc value did you get after pretraining?
for fine-tuning on the classification task, did you try fine-tuning XLMRoberta? do you get the same NaNs or the NaNs only happen with XLMRobertaLong?
are you using fp16 or fp32?
can you find where the NaNs happen? are they before the loss, after the loss, or during optimization?

shanybarhom commented 3 years ago

Hi, I'm having a similar issue and I noticed in my case (fine-tuning, with or without additional pre-training) that the NaNs issue happens when I pass the model padded input and it doesn't happen when the input is not padded. I also tried to pre-train the model according to the notebook and it works well (the input is not padded).

I pad the input using the pad_to_max_len flag of the tokenizer:

input = tokenizer("Hello, my dog is cute", max_length=model_args.max_pos, pad_to_max_length=True, return_tensors="pt")[
    "input_ids"]
out = model(input)

ibeltagy commented 3 years ago

Interesting, I don't know how the padding might cause NaNs. Do you get NaN with all batches or is it random? if you can reliably reproduce the NaN, it might be worth debugging and trying to figure out where it is coming from.

shanybarhom commented 3 years ago

Hi, regarding your question - I got NaNs with almost every batch.

I debugged the code and I observed that the embedding layer produces NaNs and 'inf' when it gets input with padding (line 752 in modeling_bert.py, Transformers 3.0.2), but I didn't find the source of this problem.

This can be reproduced by using the following code which is the notebook code, where I replaced RoBERTa with XLM-R and added at the end two lines of code that reproduce the NaNs. You might see that the NaN problem doesn't occur in the pre-training part of the notebook.

import logging
import os
import math
import torch
from torch.utils.data.dataset import Dataset
from dataclasses import dataclass, field
from transformers import XLMRobertaForMaskedLM, XLMRobertaTokenizer,TextDataset ,DataCollatorForLanguageModeling, Trainer
from transformers import TrainingArguments, HfArgumentParser
from transformers.modeling_longformer import LongformerSelfAttention
import logging

logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

class XLMRobertaLongSelfAttention(LongformerSelfAttention):
    def forward(
        self,
        hidden_states,
        attention_mask=None,
        head_mask=None,
        encoder_hidden_states=None,
        encoder_attention_mask=None,
        output_attentions=False,
    ):
       out = super().forward(hidden_states, attention_mask=attention_mask, output_attentions=output_attentions)
       return out

class XLMRobertaLongForMaskedLM(XLMRobertaForMaskedLM):
    def __init__(self, config):
        super().__init__(config)
        for i, layer in enumerate(self.roberta.encoder.layer):
            # replace the `modeling_bert.BertSelfAttention` object with `LongformerSelfAttention`
            layer.attention.self = XLMRobertaLongSelfAttention(config, layer_id=i)

def create_long_model(save_model_to, attention_window, max_pos):
    model = XLMRobertaForMaskedLM.from_pretrained('xlm-roberta-base')
    tokenizer = XLMRobertaTokenizer.from_pretrained('xlm-roberta-base', model_max_length=max_pos)
    config = model.config

    # extend position embeddings
    tokenizer.model_max_length = max_pos
    tokenizer.init_kwargs['model_max_length'] = max_pos
    current_max_pos, embed_size = model.roberta.embeddings.position_embeddings.weight.shape
    max_pos += 2  # NOTE: RoBERTa has positions 0,1 reserved, so embedding size is max position + 2
    config.max_position_embeddings = max_pos
    assert max_pos > current_max_pos
    # allocate a larger position embedding matrix
    new_pos_embed = model.roberta.embeddings.position_embeddings.weight.new_empty(max_pos, embed_size)
    # copy position embeddings over and over to initialize the new position embeddings
    k = 2
    step = current_max_pos - 2
    while k < max_pos - 1:
        new_pos_embed[k:(k + step)] = model.roberta.embeddings.position_embeddings.weight[2:]
        k += step
    model.roberta.embeddings.position_embeddings.weight.data = new_pos_embed

    # replace the `modeling_bert.BertSelfAttention` object with `LongformerSelfAttention`
    config.attention_window = [attention_window] * config.num_hidden_layers
    for i, layer in enumerate(model.roberta.encoder.layer):
        longformer_self_attn = LongformerSelfAttention(config, layer_id=i)
        longformer_self_attn.query = layer.attention.self.query
        longformer_self_attn.key = layer.attention.self.key
        longformer_self_attn.value = layer.attention.self.value

        longformer_self_attn.query_global = layer.attention.self.query
        longformer_self_attn.key_global = layer.attention.self.key
        longformer_self_attn.value_global = layer.attention.self.value

        layer.attention.self = longformer_self_attn

    logger.info(f'saving model to {save_model_to}')
    model.save_pretrained(save_model_to)
    tokenizer.save_pretrained(save_model_to)
    return model, tokenizer

def copy_proj_layers(model):
    for i, layer in enumerate(model.roberta.encoder.layer):
        layer.attention.self.query_global = layer.attention.self.query
        layer.attention.self.key_global = layer.attention.self.key
        layer.attention.self.value_global = layer.attention.self.value
    return model

def pretrain_and_evaluate(args, model, tokenizer, eval_only, model_path):
    logger.info(f'Loading the data from {args.val_datapath}')

    val_dataset = TextDataset(tokenizer=tokenizer,
                              file_path=args.val_datapath,
                              block_size=tokenizer.max_len)
    if eval_only:
        train_dataset = val_dataset
    else:
        logger.info(f'Loading and tokenizing training data is usually slow: {args.train_datapath}')
        train_dataset = TextDataset(tokenizer=tokenizer,
                                    file_path=args.train_datapath,
                                    block_size=tokenizer.max_len)

    data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)
    trainer = Trainer(model=model, args=args, data_collator=data_collator,
                      train_dataset=train_dataset, eval_dataset=val_dataset, prediction_loss_only=True,)

    eval_loss = trainer.evaluate()
    eval_loss = eval_loss['eval_loss']
    logger.info(f'Initial eval bpc: {eval_loss / math.log(2)}')

    if not eval_only:
        trainer.train(model_path=model_path)
        trainer.save_model()

        eval_loss = trainer.evaluate()
        eval_loss = eval_loss['eval_loss']
        logger.info(f'Eval bpc after pretraining: {eval_loss / math.log(2)}')

@dataclass
class ModelArgs:
    attention_window: int = field(default=512, metadata={"help": "Size of attention window"})
    max_pos: int = field(default=1024, metadata={"help": "Maximum position"})

parser = HfArgumentParser((TrainingArguments, ModelArgs,))

training_args, model_args = parser.parse_args_into_dataclasses(look_for_args_file=False, args=[
    '--output_dir', 'tmp_test',
    '--warmup_steps', '500',
    '--learning_rate', '0.00003',
    '--weight_decay', '0.01',
    '--adam_epsilon', '1e-6',
    '--max_steps', '500',
    '--logging_steps', '50',
    '--save_steps', '50',
    '--max_grad_norm', '5.0',
    '--per_gpu_eval_batch_size', '1',
    '--per_gpu_train_batch_size', '1',  # 32GB gpu with fp32
    '--gradient_accumulation_steps', '32',
    '--evaluate_during_training',
    '--do_train',
    '--do_eval',
])
training_args.val_datapath = 'wikitext-103-raw/wiki.valid.raw'
training_args.train_datapath = 'wikitext-103-raw/wiki.train.raw'

# Choose GPU
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

# roberta_base = RobertaForMaskedLM.from_pretrained('roberta-base')
# roberta_base_tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base')
# logger.info('Evaluating roberta-base (seqlen: 512) for refernece ...')
# pretrain_and_evaluate(training_args, roberta_base, roberta_base_tokenizer, eval_only=True, model_path=None)

model_path = f'{training_args.output_dir}/roberta-base-{model_args.max_pos}'
if not os.path.exists(model_path):
    os.makedirs(model_path)

logger.info(f'Converting xlm-roberta-base into roberta-base-{model_args.max_pos}')
model, tokenizer = create_long_model(
    save_model_to=model_path, attention_window=model_args.attention_window, max_pos=model_args.max_pos)

logger.info(f'Loading the model from {model_path}')
tokenizer = XLMRobertaTokenizer.from_pretrained(model_path)
model = XLMRobertaLongForMaskedLM.from_pretrained(model_path)

logger.info(f'Pretraining xlm-roberta-base-{model_args.max_pos} ... ')

# training_args.max_steps = 5   ## <<<<<<<<<<<<<<<<<<<<<<<< REMOVE THIS <<<<<<<<<<<<<<<<<<<<<<<<

pretrain_and_evaluate(training_args, model, tokenizer, eval_only=True, model_path=training_args.output_dir)

logger.info(f'Copying local projection layers into global projection layers ... ')
model = copy_proj_layers(model)
logger.info(f'Saving model to {model_path}')
model.save_pretrained(model_path)

logger.info(f'Loading the model from {model_path}')
tokenizer = XLMRobertaTokenizer.from_pretrained(model_path)
model = XLMRobertaLongForMaskedLM.from_pretrained(model_path)

input = tokenizer("Hello, my dog is cute", max_length=model_args.max_pos, pad_to_max_length=True, return_tensors="pt")[
    "input_ids"]
out = model(input)

shanybarhom commented 3 years ago

@ibeltagy Do you have any idea how to solve this?

Any help would be appreciated :)

shanybarhom commented 3 years ago

@pranav-ust @ibeltagy I found a solution that works for me: Replace: new_pos_embed = model.roberta.embeddings.position_embeddings.weight.new_empty(max_pos, embed_size) with: new_pos_embed = model.roberta.embeddings.position_embeddings.weight.new_zeros(max_pos, embed_size)

The idea behind that is that for some reason the first line of code works well for RoBERTa (creates tensor of zeros) but for the XLM-R it creates a tensor that contains "trashy" values (very big numbers). As the first two position embeddings are not initialized, this causes NaNs values in the padding vector.

MarkusSagen commented 3 years ago

@pranav-ust @ibeltagy I found a solution that works for me: Replace: new_pos_embed = model.roberta.embeddings.position_embeddings.weight.new_empty(max_pos, embed_size) with: new_pos_embed = model.roberta.embeddings.position_embeddings.weight.new_zeros(max_pos, embed_size)

The idea behind that is that for some reason the first line of code works well for RoBERTa (creates tensor of zeros) but for the XLM-R it creates a tensor that contains "trashy" values (very big numbers). As the first two position embeddings are not initialized, this causes NaNs values in the padding vector.

Nice! I solved this by lowering the gradient clipping from 5.0 to 1.0. Also got NaN when training a full XLM-R model otherwise

allenai / longformer

Finetuning XLM Longformer gets a lot of NaN values #115