Closed kvkarthikvenkateshan closed 1 year ago
Please reformat your post to use codeblocks for the sample code and the error messages to preserve line breaks and whitespace. It would also help to know the specifications of the host machine, particularly system and GPU RAM.
Generally speaking, it's advisable to split up very long text strings into smaller ones before passing them to Language.predict/pipe
as the components will be not able to make use of such large contexts anyway. And you are correct in your deduction that the memory error likely stems from the both the lengths of individual documents and the batch size.
In this use case I am trying to create noun chunks by making use of the spacy, and I am doing this for a batch (Batch size can vary from 20 to 100) Need to process 2.8Million documents overall. Using Large spacy english model In a for loop I am doing NLP on each of the document using NLP.pipe(text, disable=["tagger", "parser"]. Most of the times it is working fine but for some batches I start getting MemoryError, what is the reason for this error, is the infrastructure issue like insufficient cpu/RAM/Memory while processing that batch or is there a problem with the way I am using spacy in my code
How to reproduce the behaviour
texts = python list of Large documents Note: many of these documents have character length of 5 Million , average size of a document is 1.5 million character.
It is while loading this for loop line that I get the memory error
Is it because each element in the texts is very large text of million of characters and there are 20 or 100 such elements in this list texts that I am running into memory error?
Trace back:
Your Environment