OOM with a lot of memory untouched

jakwisn commented 3 years ago

The problem

I am training a sentence classification model using a transformer and a pipeline that is based on the default config. I am doing it on the custom dataset. When I start training I get:

RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 6.00 GiB total capacity; 1.64 GiB already allocated; 0 bytes free; 1.73 GiB reserved in total by PyTorch)

The weird things are:

This appears despite a change of the batchsize (breaks with size 1).
Length of sentences is insignificant
It appears regardless of a machine (tried on a cluster with 32GB GPU)
My trainset has 15k sentences but if I lower this to 12 it works properly

Can I specifically ask spacy/torch to reserve more memory? There must be something wrong with memory allocation or something draining the memory out.

How to reproduce the behavior

I am running with deft definition dataset and this is in my base config:

# This is an auto-generated partial config. To use it with 'spacy train'
# you can run spacy init fill-config to auto-fill all default settings:
# python -m spacy init fill-config ./base_config.cfg ./config.cfg
[paths]
train = "../../data/definition_data/train.spacy"
dev = "../../data/definition_data/dev.spacy"

[system]
gpu_allocator = "tensorflow"

[nlp]
lang = "en"
pipeline = ["transformer","textcat"]
batch_size = 256

[components]

[components.transformer]
factory = "transformer"

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v1"
name = "bert-base-uncased"
tokenizer_config = {"use_fast": true}

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.textcat]
factory = "textcat"

[components.textcat.model]
@architectures = "spacy.TextCatBOW.v2"
exclusive_classes = true
ngram_size = 1
no_output_layer = false

[corpora]

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0

[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"

[training.optimizer]
@optimizers = "Adam.v1"

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 5e-5

[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 2000
buffer = 256

[initialize]
vectors = ${paths.vectors}

Your Environment

spaCy version: 3.1.0
Platform: Linux-5.10.60.1-microsoft-standard-WSL2-x86_64-with-glibc2.29
Python version: 3.8.11
Pipelines: en_core_web_lg (3.1.0), en_core_web_sm (3.1.0), en_core_web_trf (3.1.0)

polm commented 3 years ago

Thanks for the report, sorry you're having trouble with this. I don't think we've seen this particular error before.

You already mention lowering the batchsize, which seems to be the general PyTorch advice for dealing with this error. Based on this issue it looks like running out of CPU RAM can also be an issue, could that potentially be the cause in your case?

Separately this jumped out at me:

My trainset has 15k sentences but if I lower this to 12 it works properly

Do you maybe have an unusually long sentence somewhere in the 3k you omitted?

honnibal commented 3 years ago

I think the issue is:

gpu_allocator = "tensorflow"

We only support transformers on PyTorch currently, so you'll need to change this to pytorch.

jakwisn commented 3 years ago

Hi, thanks for the pieces of advice!

I checked my RAM and it is stable, no peaks during training. Ran it also on cluster where memory is not a problem (2TB).
The sentences have max 600 characters so they do not seem to cause a problem, I even filtered them out. If I append other datasets to that 12k data it also throws an error
checked if gpu_allocator made the difference, but it did not

Meanwhile I updated pytorch to 1.10.0 and now my error looks like that:

RuntimeError: CUDA out of memory. Tried to allocate 112.00 MiB (GPU 0; 6.00 GiB total capacity; 3.95 GiB already allocated; 0 bytes free; 4.08 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting m
ax_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

(Different amounts of memory here because I did a lot of experiments and it is one of them)

Mabye fragmentation could be an issue (source - section "Memory Management")? Can I somehow fix it in spacy?

polm commented 2 years ago

Just a note - based on discussion at the linked PyTorch issue, it looks like it's a problem with PyTorch rather than something in spaCy directly. We'll leave this issue open for now, feel free to comment here if you have trouble with this specifically in spaCy, though do check the linked issue first.

huiMM commented 2 years ago

for my errors, it is the problem of CPU RAM

mzettwitz commented 1 year ago

Any news on this? I experience the same issue on different machines. Now with 24GB VRAM, cuda 11.6 on pytorch 1.12.1

EdgarPE-Corsearch commented 9 months ago

I have the same issue.

explosion / spaCy