Closed LucaPifferettiPrivate closed 11 months ago
Hi,
It's really hard without having detailed info to give targeted advice. But I can share some points:
result
DataFrame like result.write.overwrite.parquet("./...")
, then read it back (spark.read.parquet("./..."
) before moving forward. This would 100% guarantee all the stages are being transformed only once. (I would try this first)I highly recommend watching this Webinar for further information regarding hardware acceleration: https://www.johnsnowlabs.com/watch-webinar-speed-optimization-benchmarks-in-spark-nlp-3-making-the-most-of-modern-hardware/
This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 5 days
Is there an existing issue for this?
Who can help?
No response
What are you working on?
I’m using a spark nlp ner model XlmRoBertaForTokenClassification from (https://sparknlp.org/2022/08/14/xlmroberta_ner_edwardjross_base_finetuned_panx_it_3_0.html) to find person names inside a text columns with variable length.
Current Behavior
I'm applying this model to find person name inside a text column and it requires 50 minutes to give us results with a big configuration (14 executors with 14 cores and 14 Gb of memory). During the preprocessing phase I split the sentences longer than 200 characters.
Expected Behavior
I would like to increase the performance and reduce the time of computation because it is excessive for a simple 5 Gb dataframe.
Steps To Reproduce
val columns: Seq[String] = df.columns.toSeq val specialCharactersRegex: String = "[\"`'#%&,:;<>=@{}~\$\(\)\*\+\/\\\?\[\]\^\|è]" // regex to filter only special characters val partitionNumber = 196 // repartition number for 14 executors with 14 cores
Spark NLP version and Apache Spark
spark-nlp: 4.2.8 scala: 2.12.17 spark: 3.3.1
Type of Spark Application
No response
Java Version
No response
Java Home Directory
No response
Setup and installation
No response
Operating System and Version
No response
Link to your project (if available)
No response
Additional Information
No response