explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.32k stars 4.41k forks source link

OOM with a lot of memory untouched #9578

Open jakwisn opened 3 years ago

jakwisn commented 3 years ago

The problem

I am training a sentence classification model using a transformer and a pipeline that is based on the default config. I am doing it on the custom dataset. When I start training I get:

RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 6.00 GiB total capacity; 1.64 GiB already allocated; 0 bytes free; 1.73 GiB reserved in total by PyTorch)

The weird things are:

Can I specifically ask spacy/torch to reserve more memory? There must be something wrong with memory allocation or something draining the memory out.

How to reproduce the behavior

I am running with deft definition dataset and this is in my base config:

# This is an auto-generated partial config. To use it with 'spacy train'
# you can run spacy init fill-config to auto-fill all default settings:
# python -m spacy init fill-config ./base_config.cfg ./config.cfg
[paths]
train = "../../data/definition_data/train.spacy"
dev = "../../data/definition_data/dev.spacy"

[system]
gpu_allocator = "tensorflow"

[nlp]
lang = "en"
pipeline = ["transformer","textcat"]
batch_size = 256

[components]

[components.transformer]
factory = "transformer"

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v1"
name = "bert-base-uncased"
tokenizer_config = {"use_fast": true}

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.textcat]
factory = "textcat"

[components.textcat.model]
@architectures = "spacy.TextCatBOW.v2"
exclusive_classes = true
ngram_size = 1
no_output_layer = false

[corpora]

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0

[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"

[training.optimizer]
@optimizers = "Adam.v1"

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 5e-5

[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 2000
buffer = 256

[initialize]
vectors = ${paths.vectors}

Your Environment

polm commented 3 years ago

Thanks for the report, sorry you're having trouble with this. I don't think we've seen this particular error before.

You already mention lowering the batchsize, which seems to be the general PyTorch advice for dealing with this error. Based on this issue it looks like running out of CPU RAM can also be an issue, could that potentially be the cause in your case?

Separately this jumped out at me:

My trainset has 15k sentences but if I lower this to 12 it works properly

Do you maybe have an unusually long sentence somewhere in the 3k you omitted?

honnibal commented 3 years ago

I think the issue is:

gpu_allocator = "tensorflow"

We only support transformers on PyTorch currently, so you'll need to change this to pytorch.

jakwisn commented 3 years ago

Hi, thanks for the pieces of advice!

Meanwhile I updated pytorch to 1.10.0 and now my error looks like that:

RuntimeError: CUDA out of memory. Tried to allocate 112.00 MiB (GPU 0; 6.00 GiB total capacity; 3.95 GiB already allocated; 0 bytes free; 4.08 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting m
ax_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

(Different amounts of memory here because I did a lot of experiments and it is one of them)

Mabye fragmentation could be an issue (source - section "Memory Management")? Can I somehow fix it in spacy?

polm commented 2 years ago

Just a note - based on discussion at the linked PyTorch issue, it looks like it's a problem with PyTorch rather than something in spaCy directly. We'll leave this issue open for now, feel free to comment here if you have trouble with this specifically in spaCy, though do check the linked issue first.

huiMM commented 2 years ago

for my errors, it is the problem of CPU RAM

mzettwitz commented 1 year ago

Any news on this? I experience the same issue on different machines. Now with 24GB VRAM, cuda 11.6 on pytorch 1.12.1

EdgarPE-Corsearch commented 9 months ago

I have the same issue.