explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.15k stars 4.4k forks source link

MemoryError while reading a document using NLP.pipe(text, disable=["tagger", "parser"] #12510

Closed kvkarthikvenkateshan closed 1 year ago

kvkarthikvenkateshan commented 1 year ago

In this use case I am trying to create noun chunks by making use of the spacy, and I am doing this for a batch (Batch size can vary from 20 to 100) Need to process 2.8Million documents overall. Using Large spacy english model In a for loop I am doing NLP on each of the document using NLP.pipe(text, disable=["tagger", "parser"]. Most of the times it is working fine but for some batches I start getting MemoryError, what is the reason for this error, is the infrastructure issue like insufficient cpu/RAM/Memory while processing that batch or is there a problem with the way I am using spacy in my code

How to reproduce the behaviour

texts = python list of Large documents Note: many of these documents have character length of 5 Million , average size of a document is 1.5 million character.

texts = data_df[CONTENTS].to_list()

with multiprocessing.Pool(processes=no_of_cores) as pool:
    noun_chunks = create_noun_chunks(texts)

def create_noun_chunks(text: Iterable[str]) -> List[Sequence[str]]:
    """
    Create noun chunks for a given text, after remove entities
    :param text: text for which noun chunk is required
    :return: strings, Each noun chunk in the string is delimited by \n
    """
    global NLP
    if NLP is None:
        NLP = spacy.load(settings.NLP_LIB)
        NLP.max_length = 5000000
    all_chunks = []
    for txt, doc in zip(text, NLP.pipe(text, disable=["tagger", "parser"])):

It is while loading this for loop line that I get the memory error

Is it because each element in the texts is very large text of million of characters and there are 20 or 100 such elements in this list texts that I am running into memory error?

Trace back:

File "/home/user/projects/my_project/10.6.12/src/core/utilities/text_utility.py", line 174, in create_noun_chunks
    for txt, doc in zip(text, NLP.pipe(text, disable=["tagger", "parser"])):
  File "/home/user/anaconda3/envs/10.6.12/lib/python3.7/site-packages/spacy/language.py", line 1583, in pipe
    for doc in docs:
  File "/home/user/anaconda3/envs/10.6.12/lib/python3.7/site-packages/spacy/util.py", line 1611, in _pipe
    yield from proc.pipe(docs, **kwargs)
  File "spacy/pipeline/transition_parser.pyx", line 230, in pipe
  File "/home/user/anaconda3/envs/10.6.12/lib/python3.7/site-packages/spacy/util.py", line 1560, in minibatch
    batch = list(itertools.islice(items, int(batch_size)))
  File "/home/user/anaconda3/envs/10.6.12/lib/python3.7/site-packages/spacy/util.py", line 1611, in _pipe
    yield from proc.pipe(docs, **kwargs)
  File "spacy/pipeline/pipe.pyx", line 53, in pipe
  File "/home/user/anaconda3/envs/10.6.12/lib/python3.7/site-packages/spacy/util.py", line 1611, in _pipe
    yield from proc.pipe(docs, **kwargs)
  File "spacy/pipeline/pipe.pyx", line 53, in pipe
  File "/home/user/anaconda3/envs/10.6.12/lib/python3.7/site-packages/spacy/util.py", line 1611, in _pipe
    yield from proc.pipe(docs, **kwargs)
  File "spacy/pipeline/trainable_pipe.pyx", line 79, in pipe
  File "/home/user/anaconda3/envs/10.6.12/lib/python3.7/site-packages/spacy/util.py", line 1630, in raise_error
    raise e
  File "spacy/pipeline/trainable_pipe.pyx", line 75, in spacy.pipeline.trainable_pipe.TrainablePipe.pipe
  File "/home/user/anaconda3/envs/10.6.12/lib/python3.7/site-packages/spacy/pipeline/tok2vec.py", line 125, in predict
    tokvecs = self.model.predict(docs)
  File "/home/user/anaconda3/envs/10.6.12/lib/python3.7/site-packages/thinc/model.py", line 315, in predict
    return self._func(self, X, is_train=False)[0]
  File "/home/user/anaconda3/envs/10.6.12/lib/python3.7/site-packages/thinc/layers/chain.py", line 54, in forward
    Y, inc_layer_grad = layer(X, is_train=is_train)
  File "/home/user/anaconda3/envs/10.6.12/lib/python3.7/site-packages/thinc/model.py", line 291, in __call__
    return self._func(self, X, is_train=is_train)
  File "/home/user/anaconda3/envs/10.6.12/lib/python3.7/site-packages/thinc/layers/with_array.py", line 40, in forward
    return _list_forward(cast(Model[List2d, List2d], model), Xseq, is_train)
  File "/home/user/anaconda3/envs/10.6.12/lib/python3.7/site-packages/thinc/layers/with_array.py", line 76, in _list_forward
    Yf, get_dXf = layer(Xf, is_train)
  File "/home/user/anaconda3/envs/10.6.12/lib/python3.7/site-packages/thinc/model.py", line 291, in __call__
    return self._func(self, X, is_train=is_train)
  File "/home/user/anaconda3/envs/10.6.12/lib/python3.7/site-packages/thinc/layers/chain.py", line 54, in forward
    Y, inc_layer_grad = layer(X, is_train=is_train)
  File "/home/user/anaconda3/envs/10.6.12/lib/python3.7/site-packages/thinc/model.py", line 291, in __call__
    return self._func(self, X, is_train=is_train)
  File "/home/user/anaconda3/envs/10.6.12/lib/python3.7/site-packages/thinc/layers/residual.py", line 40, in forward
    Y, backprop_layer = model.layers[0](X, is_train)
  File "/home/`File "/home/user/projects/my_project/10.6.12/src/core/utilities/text_utility.py", line 174, in create_noun_chunks
    for txt, doc in zip(text, NLP.pipe(text, disable=["tagger", "parser"])):
  File "/home/user/anaconda3/envs/10.6.12/lib/python3.7/site-packages/spacy/language.py", line 1583, in pipe
    for doc in docs:
  File "/home/user/anaconda3/envs/10.6.12/lib/python3.7/site-packages/spacy/util.py", line 1611, in _pipe
    yield from proc.pipe(docs, **kwargs)
  File "spacy/pipeline/transition_parser.pyx", line 230, in pipe
  File "/home/user/anaconda3/envs/10.6.12/lib/python3.7/site-packages/spacy/util.py", line 1560, in minibatch
    batch = list(itertools.islice(items, int(batch_size)))
  File "/home/user/anaconda3/envs/10.6.12/lib/python3.7/site-packages/spacy/util.py", line 1611, in _pipe
    yield from proc.pipe(docs, **kwargs)
  File "spacy/pipeline/pipe.pyx", line 53, in pipe
  File "/home/user/anaconda3/envs/10.6.12/lib/python3.7/site-packages/spacy/util.py", line 1611, in _pipe
    yield from proc.pipe(docs, **kwargs)
  File "spacy/pipeline/pipe.pyx", line 53, in pipe
  File "/home/user/anaconda3/envs/10.6.12/lib/python3.7/site-packages/spacy/util.py", line 1611, in _pipe
    yield from proc.pipe(docs, **kwargs)
  File "spacy/pipeline/trainable_pipe.pyx", line 79, in pipe
  File "/home/user/anaconda3/envs/10.6.12/lib/python3.7/site-packages/spacy/util.py", line 1630, in raise_error
    raise e
  File "spacy/pipeline/trainable_pipe.pyx", line 75, in spacy.pipeline.trainable_pipe.TrainablePipe.pipe
  File "/home/user/anaconda3/envs/10.6.12/lib/python3.7/site-packages/spacy/pipeline/tok2vec.py", line 125, in predict
    tokvecs = self.model.predict(docs)
  File "/home/user/anaconda3/envs/10.6.12/lib/python3.7/site-packages/thinc/model.py", line 315, in predict
    return self._func(self, X, is_train=False)[0]
  File "/home/user/anaconda3/envs/10.6.12/lib/python3.7/site-packages/thinc/layers/chain.py", line 54, in forward
    Y, inc_layer_grad = layer(X, is_train=is_train)
  File "/home/user/anaconda3/envs/10.6.12/lib/python3.7/site-packages/thinc/model.py", line 291, in __call__
    return self._func(self, X, is_train=is_train)
  File "/home/user/anaconda3/envs/10.6.12/lib/python3.7/site-packages/thinc/layers/with_array.py", line 40, in forward
    return _list_forward(cast(Model[List2d, List2d], model), Xseq, is_train)
  File "/home/user/anaconda3/envs/10.6.12/lib/python3.7/site-packages/thinc/layers/with_array.py", line 76, in _list_forward
    Yf, get_dXf = layer(Xf, is_train)
  File "/home/user/anaconda3/envs/10.6.12/lib/python3.7/site-packages/thinc/model.py", line 291, in __call__
    return self._func(self, X, is_train=is_train)
  File "/home/user/anaconda3/envs/10.6.12/lib/python3.7/site-packages/thinc/layers/chain.py", line 54, in forward
    Y, inc_layer_grad = layer(X, is_train=is_train)
  File "/home/user/anaconda3/envs/10.6.12/lib/python3.7/site-packages/thinc/model.py", line 291, in __call__
    return self._func(self, X, is_train=is_train)
  File "/home/user/anaconda3/envs/10.6.12/lib/python3.7/site-packages/thinc/layers/residual.py", line 40, in forward
    Y, backprop_layer = model.layers[0](X, is_train)
  File "/home/user/anaconda3/envs/10.6.12/lib/python3.7/site-packages/thinc/model.py", line 291, in __call__
    return self._func(self, X, is_train=is_train)
  File "/home/user/anaconda3/envs/10.6.12/lib/python3.7/site-packages/thinc/layers/chain.py", line 54, in forward
    Y, inc_layer_grad = layer(X, is_train=is_train)
  File "/home/user/anaconda3/envs/10.6.12/lib/ptexts = data_df[CONTENTS].to_list()

with multiprocessing.Pool(processes=no_of_cores) as pool:
    noun_chunks = create_noun_chunks(texts)ython3.7/site-packages/thinc/model.py", line 291, in __call__
    return self._func(self, X, is_train=is_train)
  File "/home/user/anaconda3/envs/10.6.12/lib/python3.7/site-packages/thinc/layers/chain.py", line 54, in forward
    Y, inc_layer_grad = layer(X, is_train=is_train)
  File "/home/user/anaconda3/envs/10.6.12/lib/python3.7/site-packages/thinc/model.py", line 291, in __call__
    return self._func(self, X, is_train=is_train)
  File "/home/user/anaconda3/envs/10.6.12/lib/python3.7/site-packages/thinc/layers/chain.py", line 54, in forward
    Y, inc_layer_grad = layer(X, is_train=is_train)
  File "/home/user/anaconda3/envs/10.6.12/lib/python3.7/site-packages/thinc/model.py", line 291, in __call__
    return self._func(self, X, is_train=is_train)
  File "/home/user/anaconda3/envs/10.6.12/lib/python3.7/site-packages/thinc/layers/maxout.py", line 49, in forward
    Y = model.ops.gemm(X, W, trans2=True)
  File "thinc/backends/numpy_ops.pyx", line 93, in thinc.backends.numpy_ops.NumpyOps.gemm
  File "blis/py.pyx", line 72, in blis.py.gemm
MemoryErroruser/anaconda3/envs/10.6.12/lib/python3.7/site-packages/thinc/model.py", line 291, in __call__
    return self._func(self, X, is_train=is_train)
  File "/home/user/anaconda3/envs/10.6.12/lib/python3.7/site-packages/thinc/layers/chain.py", line 54, in forward
    Y, inc_layer_grad = layer(X, is_train=is_train)
  File "/home/user/anaconda3/envs/10.6.12/lib/ptexts = data_df[CONTENTS].to_list()

with multiprocessing.Pool(processes=no_of_cores) as pool:
    noun_chunks = create_noun_chunks(texts)ython3.7/site-packages/thinc/model.py", line 291, in __call__
    return self._func(self, X, is_train=is_train)
  File "/home/user/anaconda3/envs/10.6.12/lib/python3.7/site-packages/thinc/layers/chain.py", line 54, in forward
    Y, inc_layer_grad = layer(X, is_train=is_train)
  File "/home/user/anaconda3/envs/10.6.12/lib/python3.7/site-packages/thinc/model.py", line 291, in __call__
    return self._func(self, X, is_train=is_train)
  File "/home/user/anaconda3/envs/10.6.12/lib/python3.7/site-packages/thinc/layers/chain.py", line 54, in forward
    Y, inc_layer_grad = layer(X, is_train=is_train)
  File "/home/user/anaconda3/envs/10.6.12/lib/python3.7/site-packages/thinc/model.py", line 291, in __call__
    return self._func(self, X, is_train=is_train)
  File "/home/user/anaconda3/envs/10.6.12/lib/python3.7/site-packages/thinc/layers/maxout.py", line 49, in forward
    Y = model.ops.gemm(X, W, trans2=True)
  File "thinc/backends/numpy_ops.pyx", line 93, in thinc.backends.numpy_ops.NumpyOps.gemm
  File "blis/py.pyx", line 72, in blis.py.gemm
MemoryError

Your Environment

shadeMe commented 1 year ago

Please reformat your post to use codeblocks for the sample code and the error messages to preserve line breaks and whitespace. It would also help to know the specifications of the host machine, particularly system and GPU RAM.

Generally speaking, it's advisable to split up very long text strings into smaller ones before passing them to Language.predict/pipe as the components will be not able to make use of such large contexts anyway. And you are correct in your deduction that the memory error likely stems from the both the lengths of individual documents and the batch size.