allenai / longformer

Longformer: The Long-Document Transformer
https://arxiv.org/abs/2004.05150
Apache License 2.0
2.05k stars 276 forks source link

LongformerEncoderDecoder overshooting RAM: triggered OOM after training stably for 6-7 hours #204

Closed kgarg8 closed 3 years ago

kgarg8 commented 3 years ago

I am using Longformer in the following way:

from transformers.models.led.modeling_led import LEDForConditionalGeneration
from transformers.models.led.tokenization_led import LEDTokenizer

tokenizer = LEDTokenizer.from_pretrained(config["embedding_path"])
tokenized_input = tokenizer.encode(input, max_length=16384)

model = LEDForConditionalGeneration.from_pretrained(embedding_path, gradient_checkpointing=True, return_dict=True)

if not config['generate']:
    outputs = model(input_ids=tokenized_input,
                         labels=...,
                         use_cache=False, # for gradient_checkpointing flag
                         attention_mask=...,
                         decoder_attention_mask=...)

tokenized_input is of the order ~5k-18k tokens but I am truncating to length 16384.

GPU: 16GB V100 GPU memory

RAM:

              total        used        free      shared  buff/cache   available
Mem:      59G         39G         10G         40M        9.4G         19G
Swap:        0B          0B          0B

Problem:

After training on ~5500-6000 batches of size 4, the process gets automatically killed by OOM signal produced.

An important observation that I see using top is that RAM was steadily and gradually increasing over time. It was around 80% when I last checked after around 3000 batches were processed.

Note that the problem is not due to CUDA running out of memory most probably (since it already ran stably for 6-7 hours).

I tried testing with smaller sequence length like 10k, 5k but it was the same problem still.

I also tried to check if there is a memory leak in my code but doesn't seem like (to the best of my understanding).

Anyone faced similar issue before? Any directions to go from here?

wangyongjie-ntu commented 3 months ago

I also encountered this problem when i fine-tuned the longformer on a generated version of IMDB datasets. Here are the error codes:

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation. 
.....
torch.cuda.OutOfMemoryError: CUDA out of memory. 

I have set the max_length in my code:

tokenizer = LongformerTokenizerFast.from_pretrained('allenai/longformer-base-4096', max_length = 1024)

I suspect the tokenizer has bugs in handling some special words or characters.