Out of Memory error, for using pretrained model explain_document_ml over more than 5000 records

sahilsingh1123 commented 4 years ago

i was trying to use one of the explain_document_ml pretrained model for text analytics, it was working fine on smaller dataset but when i tried using it on large dataset, out of memory error started throwing.

java.lang.OutOfMemoryError: GC overhead limit exceeded at scala.collection.immutable.HashSet$HashTrieSet.updated0(HashSet.scala:554) at scala.collection.immutable.HashSet$HashTrieSet.updated0(HashSet.scala:543) at scala.collection.immutable.HashSet$HashTrieSet.updated0(HashSet.scala:543) at scala.collection.immutable.HashSet$HashTrieSet.updated0(HashSet.scala:543) at scala.collection.immutable.HashSet.$plus(HashSet.scala:84) at scala.collection.immutable.HashSet.$plus(HashSet.scala:35) at scala.collection.mutable.SetBuilder.$plus$eq(SetBuilder.scala:22) at scala.collection.mutable.SetBuilder.$plus$eq(SetBuilder.scala:20) at scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59) at scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) at scala.collection.mutable.SetBuilder.$plus$plus$eq(SetBuilder.scala:20) at scala.collection.generic.GenericCompanion.apply(GenericCompanion.scala:49) at com.johnsnowlabs.nlp.annotators.spell.norvig.NorvigSweetingModel.allWords$lzycompute(NorvigSweetingModel.scala:32) at com.johnsnowlabs.nlp.annotators.spell.norvig.NorvigSweetingModel.allWords(NorvigSweetingModel.scala:31) at com.johnsnowlabs.nlp.annotators.spell.norvig.NorvigSweetingModel.getSuggestion(NorvigSweetingModel.scala:93) at com.johnsnowlabs.nlp.annotators.spell.norvig.NorvigSweetingModel.getBestSpellingSuggestion(NorvigSweetingModel.scala:74) at com.johnsnowlabs.nlp.annotators.spell.norvig.NorvigSweetingModel.checkSpellWord(NorvigSweetingModel.scala:58) at com.johnsnowlabs.nlp.annotators.spell.norvig.NorvigSweetingModel$$anonfun$annotate$1.apply(NorvigSweetingModel.scala:44) at com.johnsnowlabs.nlp.annotators.spell.norvig.NorvigSweetingModel$$anonfun$annotate$1.apply(NorvigSweetingModel.scala:43) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at com.johnsnowlabs.nlp.annotators.spell.norvig.NorvigSweetingModel.annotate(NorvigSweetingModel.scala:43) at com.johnsnowlabs.nlp.AnnotatorModel$$anonfun$dfAnnotate$1.apply(AnnotatorModel.scala:35) at com.johnsnowlabs.nlp.AnnotatorModel$$anonfun$dfAnnotate$1.apply(AnnotatorModel.scala:34)

maziyarpanahi commented 4 years ago

Hi,

Could you please provide the information required in the issue's template? In addition, how long is each text? (It maybe 5000 but each record can be the entire page of Wikipedia). Also, how did you start the SparkSession and what are the config especially the memory for Driver/executors.

sahilsingh1123 commented 4 years ago

sparknlp version - 2.4.3 spark version - 2.4.5 java - 8 python - 3.6

every row/record has around 100- 200 words.

about spark configuration.

import pyspark
from pyspark.sql import SparkSession

config = pyspark.SparkConf().setAll([('spark.executor.memory', '2g'), ('spark.executor.cores', '5'), ('spark.cores.max', '5'),
                                     ('spark.driver.memory','2g'), ('spark.memory.storageFraction','0.35')])

spark = \
    SparkSession.builder.appName('DMXPredictiveAnalytics')\
        .config("spark.jars", "/home/fidel/cache_pretrained/sparknlpFATjar.jar")\
        .config(conf=config)\
        .master('local[*]').getOrCreate()
spark.sparkContext.setLogLevel('ERROR')

maziyarpanahi commented 4 years ago

2G is very low for both executor and the driver. Could you please increase them to 8G for start?

sahilsingh1123 commented 4 years ago

yes i had already experimented with increasing the memory with 6G, 8G, but still got the same error. and always the flow breaks at the row 5096.

maziyarpanahi commented 4 years ago

Interesting, if there is enough memory the. This shouldn't happen. How so you know at which row it crashes? Could you please share your full code from the beginning all the way to the end?

sahilsingh1123 commented 4 years ago

i uses one of your explain_document_dl pretrained model for text processing. below is the code snippet. -explainDocument = PretrainedPipeline("explain_document_dl", "en") #only need to download it once.documentPipeline is the explain_document_dl pretrained pipeline model. ` documentPipeline = infoData.get(pc.DOCUMENTPRETRAINEDPIPELINE)

    dataset = dataset.withColumnRenamed(sentimentCol, "text")

    loadedDocumentPipeline = PipelineModel.load(documentPipeline)

    dataset = loadedDocumentPipeline.transform(dataset)`

So once i get the transformed dataset, i tried to write it back to the hdfs. there it throws error of OOM. to confirm this, i tried to use the localIterator and loop the whole dataset in that looping i found that it is looping till 5096 records and after that it throws OOM error.

maziyarpanahi commented 4 years ago

Thanks, one more question regarding your setup. Where do you run this code? I see you assigned memory to executors but manually creating a SparkSession with no info whether this is a Master/Slave or local or YARN cluster.

I would like to know the setup to be sure the memory is actually being set right in the config an don't before. Also, the title says explain_document_ml but your example says explain_document_dl are they both having the same issue? Have you tried to just locally run the code on the same dataset in Python console by using sparknlp.start()?

import sparknlp
from sparknlp.pretrained import PretrainedPipeline

spark = sparknlp.start()

dataset = dataset.withColumnRenamed(sentimentCol, "text")
pipeline = PretrainedPipeline('explain_document_dl', lang='en')
dataset = pipeline.transform(dataset)
dataset.write.parquet(...)

I am interested to see what happens if you use this code locally with the same data.

sahilsingh1123 commented 4 years ago

1st- I run this code locally. 2nd- i had tried to run both explain_document_dl and _ml to see if any one of them runs. but for both i've got the same issue. And i didn't try running the code on console with sparknlp.start().

maziyarpanahi commented 4 years ago

Great thanks. By locally you mean in Jupyter or an IDE?

Could you please try that coed in your Python console in the same environment that has pip install spark-nlp pyspark? (if it's locally then you don't need executor, it's all on the driver and the memory on driver is the main memory being used.)

Mec-iS commented 3 years ago

I have the same problem while testing a production service. The pre-trained pipeline is embedded in a gRPC handler, at the 468th call with texts from 500 to 11000 characters, the service responds with:

An error occurred while calling o321.annotateJava.\n: java.lang.OutOfMemoryError: GC overhead 
limit exceeded\n\tat scala.collection.immutable.List.$anonfun$flatMap$1(List.scala:345)\n\tat 
scala.collection.immutable.List.$anonfun$flatMap$1$adapted(List.scala:338)\n\tat 
scala.collection.immutable.List$$Lambda$369/912864190.apply(Unknown Source)\n\tat 
scala.collection.immutable.List.foreach(List.scala:392)\n\tat scala.collection.immutable.List.flatMap(List.scala:338)\n\tat 
com.johnsnowlabs.nlp.annotators.spell.util.Utilities$.cartesianProduct(Utilities.scala:46)\n\tat 
com.johnsnowlabs.nlp.annotators.spell.util.Utilities$.$anonfun$cartesianProduct$1(Utilities.scala:46)\n\tat 
com.johnsnowlabs.nlp.annotators.spell.util.Utilities$$$Lambda$2913/616787763.apply(Unknown Source)\n\tat 
scala.collection.immutable.List.flatMap(List.scala:338)\n\tat 
com.johnsnowlabs.nlp.annotators.spell.util.Utilities$.cartesianProduct(Utilities.scala:46)\n\tat 
com.johnsnowlabs.nlp.annotators.spell.util.Utilities$.$anonfun$cartesianProduct$1(Utilities.scala:46)\n\tat 
com.johnsnowlabs.nlp.annotators.spell.util.Utilities$$$Lambda$2913/616787763.apply(Unknown Source)\n\tat 
scala.collection.immutable.List.flatMap(List.scala:338)\n\tat 
com.johnsnowlabs.nlp.annotators.spell.util.Utilities$.cartesianProduct(Utilities.scala:46)\n\tat 
com.johnsnowlabs.nlp.annotators.spell.util.Utilities$.$anonfun$cartesianProduct$1(Utilities.scala:46)\n\tat 
com.johnsnowlabs.nlp.annotators.spell.util.Utilities$$$Lambda$2913/616787763.apply(Unknown Source)\n\tat 
scala.collection.immutable.List.flatMap(List.scala:338)\n\tat 
com.johnsnowlabs.nlp.annotators.spell.util.Utilities$.cartesianProduct(Utilities.scala:46)\n\tat 
com.johnsnowlabs.nlp.annotators.spell.util.Utilities$.$anonfun$cartesianProduct$1(Utilities.scala:46)\n\tat 
com.johnsnowlabs.nlp.annotators.spell.util.Utilities$$$Lambda$2913/616787763.apply(Unknown Source)\n\tat 
scala.collection.immutable.List.flatMap(List.scala:338)\n\tat 
com.johnsnowlabs.nlp.annotators.spell.util.Utilities$.cartesianProduct(Utilities.scala:46)\n\tat 
com.johnsnowlabs.nlp.annotators.spell.util.Utilities$.$anonfun$cartesianProduct$1(Utilities.scala:46)\n\tat 
com.johnsnowlabs.nlp.annotators.spell.util.Utilities$$$Lambda$2913/616787763.apply(Unknown Source)\n\tat 
scala.collection.immutable.List.flatMap(List.scala:338)\n\tat 
com.johnsnowlabs.nlp.annotators.spell.util.Utilities$.cartesianProduct(Utilities.scala:46)\n\tat 
com.johnsnowlabs.nlp.annotators.spell.util.Utilities$.$anonfun$cartesianProduct$1(Utilities.scala:46)\n\tat 
com.johnsnowlabs.nlp.annotators.spell.util.Utilities$$$Lambda$2913/616787763.apply(Unknown Source)\n\tat 
scala.collection.immutable.List.flatMap(List.scala:338)\n\tat 
com.johnsnowlabs.nlp.annotators.spell.util.Utilities$.cartesianProduct(Utilities.scala:46)\n\tat 
com.johnsnowlabs.nlp.annotators.spell.util.Utilities$.$anonfun$cartesianProduct$1(Utilities.scala:46)\n\tat 
com.johnsnowlabs.nlp.annotators.spell.util.Utilities$$$Lambda$2913/616787763.apply(Unknown Source)\

Spark is started along with the Python app with:

import sparknlp
spark = sparknlp.start(
    memory="8G",
    real_time_output=True
)

inside a Docker container.

Versions used:

Ubuntu 18.04 image (same Dokerfile as suggested in the documentation)
Python 3.8.0
pyspark==3.1.2
spark-nlp==3.1.3
java-8-openjdk-amd64

UPDATE: Same thing happens at the very same point of the requests queue using:

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName(
        "Spark NLP"
    ).master("local[4]"
    ).config("spark.driver.memory","8G"
    ).config("spark.driver.maxResultSize", "0"
    ).config("spark.kryoserializer.buffer.max", "2000M"
    ).config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:3.1.3"
    ).getOrCreate()

and setting JAVA_TOOL_OPTIONS="-Xmx4G -Dorg.bytedeco.javacpp.maxBytes=0"

maziyarpanahi commented 3 years ago

Hi @Mec-iS

This is an error that is very clear. Your input dataset requires more memory than it’s being allocated to. There is nothing to do unless:

removing long sentences/ filtering those with lots of characters
Divide the DataFrame into multiple DataFrames, transform and save the results, read back the results and then unionAll into 1 DataFrame
Simply increasing the memory to the amount that it can handle the dataset based on the given tasks in the pipeline

PS: number of rows are not an accurate unit for required resources. In NLP is more about how many sentences, tokens, etc. Each row can have a document with more than 1000 sentences. Exploding them might help, or simply understanding the total number to allocate resources accordingly.

Mec-iS commented 3 years ago

Hi. Thanks for replying.

In general, afaik Spark should be able in any case to handle dataframes larger than memory because that is what is designed for. But this not applies at all to this context as I pass texts to processing one by one to the service, there is no dataframe involved. The service (a pyspark app wrapped in a gRPC interface) receives a single piece of text at the time (max few hundred Kb of text). The problem arises when requests go up to 500 consecutively, this implies afaik that somehow memory in the cluster is not cleaned up from previous runs of the pipeline and keeps adding up to eventually run out. I don't know if this related to models hoarding cache or anything similar but there is evidently a problem on how the pipelines stack (Spark driver, cluster, java, pyspark, SparkNLP ...) handle memory or in the settings required by Java or the Spark driver to avoid these crashes to happen.

UPDATE: the problem was with the input text. Although the same input is processed correctly by another NLP library, instead using SparkNLP the service go rogue and starts hoarding the host machine resources untill it crashes. The problem was in particular double-escaped strings.

maziyarpanahi commented 3 years ago

~I am sorry we need more details, we have no code and any info to reproduce. It seems to be some sort of licensed models involved which we don’t support here as we have no knowledge nor any access. As of Spark handling DataFrame more than its memory, depends on the task(s). You can have a million rows in a complicated NLP pipeline on a 8G memory. But this doesn’t seem to be the case here. I suggest opening a new issue if and only if everything you are using is Spark NLP open-source and not licensed (for that you can ask on Slack #healthcare) and not use by third-party in the middle. (How memory is being handled by other libraries cannot be controlled from us).~

It turns out one of the models in those pipelines had an issue with a memory leak due to some unknown text. (this should be fixed in Spark NLP 3.3.0)

JohnSnowLabs / spark-nlp

Out of Memory error, for using pretrained model explain_document_ml over more than 5000 records #977