Closed sahilsingh1123 closed 2 years ago
Hi,
Could you please provide the information required in the issue's template? In addition, how long is each text? (It maybe 5000 but each record can be the entire page of Wikipedia). Also, how did you start the SparkSession and what are the config especially the memory for Driver/executors.
sparknlp version - 2.4.3 spark version - 2.4.5 java - 8 python - 3.6
every row/record has around 100- 200 words.
about spark configuration.
import pyspark
from pyspark.sql import SparkSession
config = pyspark.SparkConf().setAll([('spark.executor.memory', '2g'), ('spark.executor.cores', '5'), ('spark.cores.max', '5'),
('spark.driver.memory','2g'), ('spark.memory.storageFraction','0.35')])
spark = \
SparkSession.builder.appName('DMXPredictiveAnalytics')\
.config("spark.jars", "/home/fidel/cache_pretrained/sparknlpFATjar.jar")\
.config(conf=config)\
.master('local[*]').getOrCreate()
spark.sparkContext.setLogLevel('ERROR')
2G is very low for both executor and the driver. Could you please increase them to 8G for start?
yes i had already experimented with increasing the memory with 6G, 8G, but still got the same error. and always the flow breaks at the row 5096.
Interesting, if there is enough memory the. This shouldn't happen. How so you know at which row it crashes? Could you please share your full code from the beginning all the way to the end?
i uses one of your explain_document_dl pretrained model for text processing. below is the code snippet. -explainDocument = PretrainedPipeline("explain_document_dl", "en") #only need to download it once.documentPipeline is the explain_document_dl pretrained pipeline model. ` documentPipeline = infoData.get(pc.DOCUMENTPRETRAINEDPIPELINE)
dataset = dataset.withColumnRenamed(sentimentCol, "text")
loadedDocumentPipeline = PipelineModel.load(documentPipeline)
dataset = loadedDocumentPipeline.transform(dataset)`
So once i get the transformed dataset, i tried to write it back to the hdfs. there it throws error of OOM. to confirm this, i tried to use the localIterator and loop the whole dataset in that looping i found that it is looping till 5096 records and after that it throws OOM error.
Thanks, one more question regarding your setup. Where do you run this code? I see you assigned memory to executors but manually creating a SparkSession with no info whether this is a Master/Slave or local or YARN cluster.
I would like to know the setup to be sure the memory is actually being set right in the config an don't before. Also, the title says explain_document_ml
but your example says explain_document_dl
are they both having the same issue? Have you tried to just locally run the code on the same dataset in Python console by using sparknlp.start()
?
import sparknlp
from sparknlp.pretrained import PretrainedPipeline
spark = sparknlp.start()
dataset = dataset.withColumnRenamed(sentimentCol, "text")
pipeline = PretrainedPipeline('explain_document_dl', lang='en')
dataset = pipeline.transform(dataset)
dataset.write.parquet(...)
I am interested to see what happens if you use this code locally with the same data.
1st- I run this code locally. 2nd- i had tried to run both explain_document_dl and _ml to see if any one of them runs. but for both i've got the same issue. And i didn't try running the code on console with sparknlp.start().
Great thanks. By locally you mean in Jupyter or an IDE?
pip install spark-nlp pyspark
? (if it's locally then you don't need executor, it's all on the driver and the memory on driver is the main memory being used.)I have the same problem while testing a production service. The pre-trained pipeline is embedded in a gRPC handler, at the 468th call with texts from 500 to 11000 characters, the service responds with:
An error occurred while calling o321.annotateJava.\n: java.lang.OutOfMemoryError: GC overhead
limit exceeded\n\tat scala.collection.immutable.List.$anonfun$flatMap$1(List.scala:345)\n\tat
scala.collection.immutable.List.$anonfun$flatMap$1$adapted(List.scala:338)\n\tat
scala.collection.immutable.List$$Lambda$369/912864190.apply(Unknown Source)\n\tat
scala.collection.immutable.List.foreach(List.scala:392)\n\tat scala.collection.immutable.List.flatMap(List.scala:338)\n\tat
com.johnsnowlabs.nlp.annotators.spell.util.Utilities$.cartesianProduct(Utilities.scala:46)\n\tat
com.johnsnowlabs.nlp.annotators.spell.util.Utilities$.$anonfun$cartesianProduct$1(Utilities.scala:46)\n\tat
com.johnsnowlabs.nlp.annotators.spell.util.Utilities$$$Lambda$2913/616787763.apply(Unknown Source)\n\tat
scala.collection.immutable.List.flatMap(List.scala:338)\n\tat
com.johnsnowlabs.nlp.annotators.spell.util.Utilities$.cartesianProduct(Utilities.scala:46)\n\tat
com.johnsnowlabs.nlp.annotators.spell.util.Utilities$.$anonfun$cartesianProduct$1(Utilities.scala:46)\n\tat
com.johnsnowlabs.nlp.annotators.spell.util.Utilities$$$Lambda$2913/616787763.apply(Unknown Source)\n\tat
scala.collection.immutable.List.flatMap(List.scala:338)\n\tat
com.johnsnowlabs.nlp.annotators.spell.util.Utilities$.cartesianProduct(Utilities.scala:46)\n\tat
com.johnsnowlabs.nlp.annotators.spell.util.Utilities$.$anonfun$cartesianProduct$1(Utilities.scala:46)\n\tat
com.johnsnowlabs.nlp.annotators.spell.util.Utilities$$$Lambda$2913/616787763.apply(Unknown Source)\n\tat
scala.collection.immutable.List.flatMap(List.scala:338)\n\tat
com.johnsnowlabs.nlp.annotators.spell.util.Utilities$.cartesianProduct(Utilities.scala:46)\n\tat
com.johnsnowlabs.nlp.annotators.spell.util.Utilities$.$anonfun$cartesianProduct$1(Utilities.scala:46)\n\tat
com.johnsnowlabs.nlp.annotators.spell.util.Utilities$$$Lambda$2913/616787763.apply(Unknown Source)\n\tat
scala.collection.immutable.List.flatMap(List.scala:338)\n\tat
com.johnsnowlabs.nlp.annotators.spell.util.Utilities$.cartesianProduct(Utilities.scala:46)\n\tat
com.johnsnowlabs.nlp.annotators.spell.util.Utilities$.$anonfun$cartesianProduct$1(Utilities.scala:46)\n\tat
com.johnsnowlabs.nlp.annotators.spell.util.Utilities$$$Lambda$2913/616787763.apply(Unknown Source)\n\tat
scala.collection.immutable.List.flatMap(List.scala:338)\n\tat
com.johnsnowlabs.nlp.annotators.spell.util.Utilities$.cartesianProduct(Utilities.scala:46)\n\tat
com.johnsnowlabs.nlp.annotators.spell.util.Utilities$.$anonfun$cartesianProduct$1(Utilities.scala:46)\n\tat
com.johnsnowlabs.nlp.annotators.spell.util.Utilities$$$Lambda$2913/616787763.apply(Unknown Source)\n\tat
scala.collection.immutable.List.flatMap(List.scala:338)\n\tat
com.johnsnowlabs.nlp.annotators.spell.util.Utilities$.cartesianProduct(Utilities.scala:46)\n\tat
com.johnsnowlabs.nlp.annotators.spell.util.Utilities$.$anonfun$cartesianProduct$1(Utilities.scala:46)\n\tat
com.johnsnowlabs.nlp.annotators.spell.util.Utilities$$$Lambda$2913/616787763.apply(Unknown Source)\
Spark is started along with the Python app with:
import sparknlp
spark = sparknlp.start(
memory="8G",
real_time_output=True
)
inside a Docker container.
Versions used:
UPDATE: Same thing happens at the very same point of the requests queue using:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName(
"Spark NLP"
).master("local[4]"
).config("spark.driver.memory","8G"
).config("spark.driver.maxResultSize", "0"
).config("spark.kryoserializer.buffer.max", "2000M"
).config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:3.1.3"
).getOrCreate()
and setting JAVA_TOOL_OPTIONS="-Xmx4G -Dorg.bytedeco.javacpp.maxBytes=0"
Hi @Mec-iS
This is an error that is very clear. Your input dataset requires more memory than it’s being allocated to. There is nothing to do unless:
PS: number of rows are not an accurate unit for required resources. In NLP is more about how many sentences, tokens, etc. Each row can have a document with more than 1000 sentences. Exploding them might help, or simply understanding the total number to allocate resources accordingly.
Hi. Thanks for replying.
In general, afaik Spark should be able in any case to handle dataframes larger than memory because that is what is designed for.
But this not applies at all to this context as I pass texts to processing one by one to the service, there is no dataframe involved. The service (a pyspark
app wrapped in a gRPC interface) receives a single piece of text at the time (max few hundred Kb of text). The problem arises when requests go up to 500 consecutively, this implies afaik that somehow memory in the cluster is not cleaned up from previous runs of the pipeline and keeps adding up to eventually run out. I don't know if this related to models hoarding cache or anything similar but there is evidently a problem on how the pipelines stack (Spark driver, cluster, java, pyspark, SparkNLP ...) handle memory or in the settings required by Java or the Spark driver to avoid these crashes to happen.
UPDATE: the problem was with the input text. Although the same input is processed correctly by another NLP library, instead using SparkNLP the service go rogue and starts hoarding the host machine resources untill it crashes. The problem was in particular double-escaped strings.
~I am sorry we need more details, we have no code and any info to reproduce. It seems to be some sort of licensed models involved which we don’t support here as we have no knowledge nor any access. As of Spark handling DataFrame more than its memory, depends on the task(s). You can have a million rows in a complicated NLP pipeline on a 8G memory. But this doesn’t seem to be the case here. I suggest opening a new issue if and only if everything you are using is Spark NLP open-source and not licensed (for that you can ask on Slack #healthcare) and not use by third-party in the middle. (How memory is being handled by other libraries cannot be controlled from us).~
It turns out one of the models in those pipelines had an issue with a memory leak due to some unknown text. (this should be fixed in Spark NLP 3.3.0)
i was trying to use one of the explain_document_ml pretrained model for text analytics, it was working fine on smaller dataset but when i tried using it on large dataset, out of memory error started throwing.
java.lang.OutOfMemoryError: GC overhead limit exceeded at scala.collection.immutable.HashSet$HashTrieSet.updated0(HashSet.scala:554) at scala.collection.immutable.HashSet$HashTrieSet.updated0(HashSet.scala:543) at scala.collection.immutable.HashSet$HashTrieSet.updated0(HashSet.scala:543) at scala.collection.immutable.HashSet$HashTrieSet.updated0(HashSet.scala:543) at scala.collection.immutable.HashSet.$plus(HashSet.scala:84) at scala.collection.immutable.HashSet.$plus(HashSet.scala:35) at scala.collection.mutable.SetBuilder.$plus$eq(SetBuilder.scala:22) at scala.collection.mutable.SetBuilder.$plus$eq(SetBuilder.scala:20) at scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59) at scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) at scala.collection.mutable.SetBuilder.$plus$plus$eq(SetBuilder.scala:20) at scala.collection.generic.GenericCompanion.apply(GenericCompanion.scala:49) at com.johnsnowlabs.nlp.annotators.spell.norvig.NorvigSweetingModel.allWords$lzycompute(NorvigSweetingModel.scala:32) at com.johnsnowlabs.nlp.annotators.spell.norvig.NorvigSweetingModel.allWords(NorvigSweetingModel.scala:31) at com.johnsnowlabs.nlp.annotators.spell.norvig.NorvigSweetingModel.getSuggestion(NorvigSweetingModel.scala:93) at com.johnsnowlabs.nlp.annotators.spell.norvig.NorvigSweetingModel.getBestSpellingSuggestion(NorvigSweetingModel.scala:74) at com.johnsnowlabs.nlp.annotators.spell.norvig.NorvigSweetingModel.checkSpellWord(NorvigSweetingModel.scala:58) at com.johnsnowlabs.nlp.annotators.spell.norvig.NorvigSweetingModel$$anonfun$annotate$1.apply(NorvigSweetingModel.scala:44) at com.johnsnowlabs.nlp.annotators.spell.norvig.NorvigSweetingModel$$anonfun$annotate$1.apply(NorvigSweetingModel.scala:43) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at com.johnsnowlabs.nlp.annotators.spell.norvig.NorvigSweetingModel.annotate(NorvigSweetingModel.scala:43) at com.johnsnowlabs.nlp.AnnotatorModel$$anonfun$dfAnnotate$1.apply(AnnotatorModel.scala:35) at com.johnsnowlabs.nlp.AnnotatorModel$$anonfun$dfAnnotate$1.apply(AnnotatorModel.scala:34)