JohnSnowLabs / spark-nlp

State of the Art Natural Language Processing
https://sparknlp.org/
Apache License 2.0
3.88k stars 711 forks source link

java.lang.NoSuchMethodError: breeze.storage.Zero$.FloatZero()Lbreeze/storage/Zero; #14376

Open SidWeng opened 3 months ago

SidWeng commented 3 months ago

Is there an existing issue for this?

Who can help?

No response

What are you working on?

train a classifier with MPNetEmbeddings

Current Behavior

throw following exception during pipeline.fit()

24/08/22 13:02:34.982 [Executor task launch worker for task 0.0 in stage 2.0 (TID 9)] ERROR org.apache.spark.executor.Executor - Exception in task 0.0 in stage 2.0 (TID 9)
java.lang.NoSuchMethodError: breeze.storage.Zero$.FloatZero()Lbreeze/storage/Zero;
    at com.johnsnowlabs.ml.util.LinAlg$.$anonfun$avgPooling$1(LinAlg.scala:112)
    at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
    at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
    at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
    at scala.collection.TraversableLike.map(TraversableLike.scala:286)
    at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
    at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
    at com.johnsnowlabs.ml.util.LinAlg$.avgPooling(LinAlg.scala:112)
    at com.johnsnowlabs.ml.ai.MPNet.getSentenceEmbeddingFromOnnx(MPNet.scala:192)
    at com.johnsnowlabs.ml.ai.MPNet.getSentenceEmbedding(MPNet.scala:74)
    at com.johnsnowlabs.ml.ai.MPNet.$anonfun$predict$1(MPNet.scala:237)
    at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
    at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
    at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
    at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293)
    at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290)
    at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:198)
    at com.johnsnowlabs.ml.ai.MPNet.predict(MPNet.scala:231)
    at com.johnsnowlabs.nlp.embeddings.MPNetEmbeddings.batchAnnotate(MPNetEmbeddings.scala:317)
    at com.johnsnowlabs.nlp.HasBatchedAnnotate.processBatchRows(HasBatchedAnnotate.scala:65)
    at com.johnsnowlabs.nlp.HasBatchedAnnotate.$anonfun$batchProcess$1(HasBatchedAnnotate.scala:53)
    at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hashAgg_doAggregateWithKeys_0$(Unknown Source)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
    at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
    at org.apache.spark.scheduler.Task.run(Task.scala:136)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)

Expected Behavior

should not have such exception

Steps To Reproduce

val documentAssembler = new DocumentAssembler()
  .setInputCol("ref")
  .setOutputCol("document")

val sentenceEmbeddings = MPNetEmbeddings.pretrained("all_mpnet_base_v2", "en").setInputCols(Array("document")).setOutputCol("embeddings")

val docClassifier = new ClassifierDLApproach()
  .setInputCols("embeddings")
  .setOutputCol("category")
  .setLabelColumn("label")
  .setBatchSize(8)
  .setMaxEpochs(1)
  .setLr(5e-3f)
  .setDropout(0.5f)
  .setRandomSeed(44)

val pipeline = new Pipeline()
  .setStages(Array(documentAssembler, sentenceEmbeddings, docClassifier))

val pipelineModel = pipeline.fit(data)

Spark NLP version and Apache Spark

Spark NLP: 5.4.1 Apache Spark: 3.3.0

Type of Spark Application

No response

Java Version

No response

Java Home Directory

No response

Setup and installation

No response

Operating System and Version

Ubuntu 20.04

Link to your project (if available)

No response

Additional Information

No response

SidWeng commented 3 months ago

Turn out to be dependency conflict of breeze library. After remove the old version breeze library, another exception happens:

05:38:00.344 [main] ERROR org.apache.spark.broadcast.TorrentBroadcast - Store broadcast broadcast_0 fail, remove all pieces of the broadcast
java.lang.NoClassDefFoundError: breeze/storage/Zero$DoubleZero$
  at java.lang.Class.forName0(Native Method)
  at java.lang.Class.forName(Class.java:348)
  at org.apache.spark.util.Utils$.classForName(Utils.scala:218)
  at org.apache.spark.serializer.KryoSerializer$.$anonfun$loadableSparkClasses$1(KryoSerializer.scala:537)
  at scala.collection.immutable.List.flatMap(List.scala:366)
  at org.apache.spark.serializer.KryoSerializer$.loadableSparkClasses$lzycompute(KryoSerializer.scala:535)
  at org.apache.spark.serializer.KryoSerializer$.org$apache$spark$serializer$KryoSerializer$$loadableSparkClasses(KryoSerializer.scala:502)
  at org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:226)
  at org.apache.spark.serializer.KryoSerializer$$anon$1.create(KryoSerializer.scala:102)
  at com.esotericsoftware.kryo.pool.KryoPoolQueueImpl.borrow(KryoPoolQueueImpl.java:48)
  at org.apache.spark.serializer.KryoSerializer$PoolWrapper.borrow(KryoSerializer.scala:109)
  at org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:346)
  at org.apache.spark.serializer.KryoSerializationStream.<init>(KryoSerializer.scala:266)
  at org.apache.spark.serializer.KryoSerializerInstance.serializeStream(KryoSerializer.scala:432)
  at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:319)
  at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:140)
  at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:95)
  at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
  at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:75)
  at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1529)
  at org.apache.spark.SparkContext.$anonfun$hadoopFile$1(SparkContext.scala:1145)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.SparkContext.withScope(SparkContext.scala:806)
  at org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:1137)
  at org.apache.spark.SparkContext.$anonfun$textFile$1(SparkContext.scala:940)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.SparkContext.withScope(SparkContext.scala:806)
  at org.apache.spark.SparkContext.textFile(SparkContext.scala:937)
  at org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:587)
  at org.apache.spark.ml.util.DefaultParamsReader.load(ReadWrite.scala:465)
  at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:31)
  at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:24)
  at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:515)
  at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:507)
  at com.johnsnowlabs.nlp.HasPretrained.pretrained(HasPretrained.scala:44)
  at com.johnsnowlabs.nlp.HasPretrained.pretrained$(HasPretrained.scala:41)
  at com.johnsnowlabs.nlp.embeddings.MPNetEmbeddings$.com$johnsnowlabs$nlp$embeddings$ReadablePretrainedMPNetModel$$super$pretrained(MPNetEmbeddings.scala:474)
  at com.johnsnowlabs.nlp.embeddings.ReadablePretrainedMPNetModel.pretrained(MPNetEmbeddings.scala:401)
  at com.johnsnowlabs.nlp.embeddings.ReadablePretrainedMPNetModel.pretrained$(MPNetEmbeddings.scala:400)
  at com.johnsnowlabs.nlp.embeddings.MPNetEmbeddings$.pretrained(MPNetEmbeddings.scala:474)
  at com.johnsnowlabs.nlp.embeddings.MPNetEmbeddings$.pretrained(MPNetEmbeddings.scala:474)
  at com.johnsnowlabs.nlp.HasPretrained.pretrained(HasPretrained.scala:47)
  at com.johnsnowlabs.nlp.HasPretrained.pretrained$(HasPretrained.scala:47)
  at com.johnsnowlabs.nlp.embeddings.MPNetEmbeddings$.com$johnsnowlabs$nlp$embeddings$ReadablePretrainedMPNetModel$$super$pretrained(MPNetEmbeddings.scala:474)
  at com.johnsnowlabs.nlp.embeddings.ReadablePretrainedMPNetModel.pretrained(MPNetEmbeddings.scala:398)
  at com.johnsnowlabs.nlp.embeddings.ReadablePretrainedMPNetModel.pretrained$(MPNetEmbeddings.scala:397)
  at com.johnsnowlabs.nlp.embeddings.MPNetEmbeddings$.pretrained(MPNetEmbeddings.scala:474)
  ... 79 elided
Caused by: java.lang.ClassNotFoundException: breeze.storage.Zero$DoubleZero$
  at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
  ... 128 more

I guess it's related to Kryo since I set KryoSerializer as default serializer. And it works good after I unset KryoSerializer

maziyarpanahi commented 3 months ago

Please share information about where you are, which spark is this, what's the environment, and how you are installing and starting SparkSession with Spark NLP.

SidWeng commented 3 months ago

OS: Ubuntu 20.04 Spark: 3.32.0 Java: 1.8.0_412 Installation: put spark-nlp-assembly-5.4.1.jar under SPARK_HOME/jars start SparkSession: SPARK_HOME/bin/spark-shell --master spark://master-ip:7077

maziyarpanahi commented 2 months ago

Please use --jars PATH/spark-nlp-assembly-5.4.1.jar explicitly in your spark-shell command and try again. It seems there is a mismatch between Spark NLP and Apache Spark versions.

If you can quickly do this in your Ubuntu terminal would be a great way to test everything:

conda create -n sparknlp python=3.8 -y
conda activate sparknlp
pip install spark-nlp==5.4.2 pyspark==3.3.1

Then in the same terminal use Python console

$ python
import sparknlp
spark = sparknlp.start()

# rest of your code