JohnSnowLabs / spark-nlp

State of the Art Natural Language Processing
https://sparknlp.org/
Apache License 2.0
3.86k stars 711 forks source link

Py4JError: com.johnsnowlabs.nlp.DocumentAssembler does not exist in the JVM #14385

Closed omnoy closed 2 months ago

omnoy commented 2 months ago

Is there an existing issue for this?

Who can help?

@maziyarpanahi

What are you working on?

Been struggling with running a simple demo of the SparkNLP Sentiment Analysis, using a SparkSession to initialize since I am planning to run an integration between kafka and SparkNLP.

Current Behavior

Exception in thread "Thread-5" java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at py4j.reflection.CurrentThreadClassLoadingStrategy.classForName(CurrentThreadClassLoadingStrategy.java:40) at py4j.reflection.ReflectionUtil.classForName(ReflectionUtil.java:51) at py4j.reflection.TypeUtil.forName(TypeUtil.java:243) at py4j.commands.ReflectionCommand.getUnknownMember(ReflectionCommand.java:175) at py4j.commands.ReflectionCommand.execute(ReflectionCommand.java:87) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.lang.Thread.run(Thread.java:750) Caused by: java.lang.ClassNotFoundException: scala.collection.GenTraversableOnce at java.net.URLClassLoader.findClass(URLClassLoader.java:387) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) ... 10 more ERROR:root:Exception while sending command. Traceback (most recent call last): File "/opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/clientserver.py", line 516, in send_command raise Py4JNetworkError("Answer from Java side is empty") py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1038, in send_command response = connection.send_command(command) File "/opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/clientserver.py", line 539, in send_command raise Py4JNetworkError( py4j.protocol.Py4JNetworkError: Error while sending or receiving

Expected Behavior

Demo should run without an error.

Steps To Reproduce

Running these two code blocks in a jupyter notebook cell with a conda kernel with python 3.10.14, tried this with a normal python 3.10.14 kernel as well.

# Initialize PySpark
import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
import os

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-10_2.13:3.5.1,org.apache.spark:spark-sql-kafka-0-10_2.13:3.5.1,com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.2 pyspark-shell'

sc = pyspark.SparkContext(appName="cluster")
spark = SparkSession.builder \
    .appName("FinalProject") \
    .master("local[*]") \
    .config("spark.driver.memory", "16G") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .config("spark.kryoserializer.buffer.max", "2000M") \
    .config("spark.driver.maxResultSize", "0") \
    .getOrCreate()
from sparknlp import DocumentAssembler
from sparknlp.base import *
from sparknlp.pretrained import *
from sparknlp.annotation import *
from sparknlp.annotator import UniversalSentenceEncoder, SentimentDLModel
from pyspark.ml import Pipeline

document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")

use = UniversalSentenceEncoder.pretrained('tfhub_use', lang="en") \
.setInputCols(["document"])\
.setOutputCol("sentence_embeddings")

classifier = SentimentDLModel().pretrained('sentimentdl_use_twitter')\
.setInputCols(["sentence_embeddings"])\
.setOutputCol("sentiment")

nlp_pipeline = Pipeline(stages=[document_assembler,
use,
classifier
])

l_model = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))

annotations = l_model.fullAnnotate(["im meeting up with one of my besties tonight! Cant wait!!  - GIRL TALK!!", "is upset that he can't update his Facebook by texting it... and might cry as a result  School today also. Blah!"])

Spark NLP version and Apache Spark

sparknlp.version()=5.4.2 spark.version=3.5.1

Type of Spark Application

Python Application

Java Version

openjdk version "1.8.0_422", but same results for java 11 as well

Java Home Directory

No response

Setup and installation

SparkNLP and Spark were installed both on conda and base python with pip install. SparkNLP jar was downloaded and moved to /opt/spark/jars

Operating System and Version

5.15.153.1-microsoft-standard-WSL2, Ubuntu 22.04.2 LTS

Link to your project (if available)

No response

Additional Information

No response

maziyarpanahi commented 2 months ago

Hi,

Could you please add the maven package or the Fat JAR (from our release notes) in the SparkSession explicitly? https://sparknlp.org/docs/en/install#start-spark-nlp-session-from-python

spark = SparkSession.builder \
    .appName("Spark NLP") \
    .master("local[*]") \
    .config("spark.driver.memory", "16G") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .config("spark.kryoserializer.buffer.max", "2000M") \
    .config("spark.driver.maxResultSize", "0") \
    .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.2") \
    .getOrCreate()

Here is an example: https://colab.research.google.com/drive/1lFozDD16iRQg5z3wQKUi0-7oUKfPOw6H?usp=sharing

omnoy commented 2 months ago

Added it and the same error pops up. Running the example ipynb locally also seems to have the same issue:

Exception in thread "Thread-5" java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:348)
    at py4j.reflection.CurrentThreadClassLoadingStrategy.classForName(CurrentThreadClassLoadingStrategy.java:40)
    at py4j.reflection.ReflectionUtil.classForName(ReflectionUtil.java:51)
    at py4j.reflection.TypeUtil.forName(TypeUtil.java:243)
    at py4j.commands.ReflectionCommand.getUnknownMember(ReflectionCommand.java:175)
    at py4j.commands.ReflectionCommand.execute(ReflectionCommand.java:87)
    at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
    at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
    at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.ClassNotFoundException: scala.collection.GenTraversableOnce
    at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
    ... 10 more
ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/home/omer/.local/lib/python3.10/site-packages/py4j/clientserver.py", line 516, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/omer/.local/lib/python3.10/site-packages/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
  File "/home/omer/.local/lib/python3.10/site-packages/py4j/clientserver.py", line 539, in send_command
    raise Py4JNetworkError(
py4j.protocol.Py4JNetworkError: Error while sending or receiving
---------------------------------------------------------------------------
Py4JError                                 Traceback (most recent call last)
Cell In[2], [line 8](vscode-notebook-cell:?execution_count=2&line=8)
      [5](vscode-notebook-cell:?execution_count=2&line=5) from sparknlp.annotator import UniversalSentenceEncoder, SentimentDLModel
      [6](vscode-notebook-cell:?execution_count=2&line=6) from pyspark.ml import Pipeline
----> [8](vscode-notebook-cell:?execution_count=2&line=8) document_assembler = DocumentAssembler() \
      [9](vscode-notebook-cell:?execution_count=2&line=9)   .setInputCol("text") \
     [10](vscode-notebook-cell:?execution_count=2&line=10)   .setOutputCol("document")
     [12](vscode-notebook-cell:?execution_count=2&line=12) use = UniversalSentenceEncoder.pretrained('tfhub_use', lang="en") \
     [13](vscode-notebook-cell:?execution_count=2&line=13)   .setInputCols(["document"])\
     [14](vscode-notebook-cell:?execution_count=2&line=14)   .setOutputCol("sentence_embeddings")
     [16](vscode-notebook-cell:?execution_count=2&line=16) classifier = SentimentDLModel().pretrained('sentimentdl_use_twitter')\
     [17](vscode-notebook-cell:?execution_count=2&line=17)   .setInputCols(["sentence_embeddings"])\
     [18](vscode-notebook-cell:?execution_count=2&line=18)   .setOutputCol("sentiment")

File ~/anaconda3/envs/projectEnv/lib/python3.10/site-packages/pyspark/__init__.py:139, in keyword_only.<locals>.wrapper(self, *args, **kwargs)
    [137](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/omer/bigdata/FinalProject/~/anaconda3/envs/projectEnv/lib/python3.10/site-packages/pyspark/__init__.py:137)     raise TypeError("Method %s forces keyword arguments." % func.__name__)
    [138](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/omer/bigdata/FinalProject/~/anaconda3/envs/projectEnv/lib/python3.10/site-packages/pyspark/__init__.py:138) self._input_kwargs = kwargs
--> [139](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/omer/bigdata/FinalProject/~/anaconda3/envs/projectEnv/lib/python3.10/site-packages/pyspark/__init__.py:139) return func(self, **kwargs)

File ~/.local/lib/python3.10/site-packages/sparknlp/base/document_assembler.py:96, in DocumentAssembler.__init__(self)
     [94](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/omer/bigdata/FinalProject/~/.local/lib/python3.10/site-packages/sparknlp/base/document_assembler.py:94) @keyword_only
     [95](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/omer/bigdata/FinalProject/~/.local/lib/python3.10/site-packages/sparknlp/base/document_assembler.py:95) def __init__(self):
---> [96](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/omer/bigdata/FinalProject/~/.local/lib/python3.10/site-packages/sparknlp/base/document_assembler.py:96)     super(DocumentAssembler, self).__init__(classname="com.johnsnowlabs.nlp.DocumentAssembler")
     [97](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/omer/bigdata/FinalProject/~/.local/lib/python3.10/site-packages/sparknlp/base/document_assembler.py:97)     self._setDefault(outputCol="document", cleanupMode='disabled')

File ~/anaconda3/envs/projectEnv/lib/python3.10/site-packages/pyspark/__init__.py:139, in keyword_only.<locals>.wrapper(self, *args, **kwargs)
    [137](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/omer/bigdata/FinalProject/~/anaconda3/envs/projectEnv/lib/python3.10/site-packages/pyspark/__init__.py:137)     raise TypeError("Method %s forces keyword arguments." % func.__name__)
    [138](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/omer/bigdata/FinalProject/~/anaconda3/envs/projectEnv/lib/python3.10/site-packages/pyspark/__init__.py:138) self._input_kwargs = kwargs
--> [139](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/omer/bigdata/FinalProject/~/anaconda3/envs/projectEnv/lib/python3.10/site-packages/pyspark/__init__.py:139) return func(self, **kwargs)

File ~/.local/lib/python3.10/site-packages/sparknlp/internal/annotator_transformer.py:36, in AnnotatorTransformer.__init__(self, classname)
     [34](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/omer/bigdata/FinalProject/~/.local/lib/python3.10/site-packages/sparknlp/internal/annotator_transformer.py:34) self.setParams(**kwargs)
     [35](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/omer/bigdata/FinalProject/~/.local/lib/python3.10/site-packages/sparknlp/internal/annotator_transformer.py:35) self.__class__._java_class_name = classname
---> [36](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/omer/bigdata/FinalProject/~/.local/lib/python3.10/site-packages/sparknlp/internal/annotator_transformer.py:36) self._java_obj = self._new_java_obj(classname, self.uid)

File ~/anaconda3/envs/projectEnv/lib/python3.10/site-packages/pyspark/ml/wrapper.py:84, in JavaWrapper._new_java_obj(java_class, *args)
     [82](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/omer/bigdata/FinalProject/~/anaconda3/envs/projectEnv/lib/python3.10/site-packages/pyspark/ml/wrapper.py:82) java_obj = _jvm()
     [83](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/omer/bigdata/FinalProject/~/anaconda3/envs/projectEnv/lib/python3.10/site-packages/pyspark/ml/wrapper.py:83) for name in java_class.split("."):
---> [84](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/omer/bigdata/FinalProject/~/anaconda3/envs/projectEnv/lib/python3.10/site-packages/pyspark/ml/wrapper.py:84)     java_obj = getattr(java_obj, name)
     [85](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/omer/bigdata/FinalProject/~/anaconda3/envs/projectEnv/lib/python3.10/site-packages/pyspark/ml/wrapper.py:85) java_args = [_py2java(sc, arg) for arg in args]
     [86](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/omer/bigdata/FinalProject/~/anaconda3/envs/projectEnv/lib/python3.10/site-packages/pyspark/ml/wrapper.py:86) return java_obj(*java_args)

File ~/.local/lib/python3.10/site-packages/py4j/java_gateway.py:1664, in JavaPackage.__getattr__(self, name)
   [1661](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/omer/bigdata/FinalProject/~/.local/lib/python3.10/site-packages/py4j/java_gateway.py:1661)     return JavaClass(
   [1662](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/omer/bigdata/FinalProject/~/.local/lib/python3.10/site-packages/py4j/java_gateway.py:1662)         answer[proto.CLASS_FQN_START:], self._gateway_client)
   [1663](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/omer/bigdata/FinalProject/~/.local/lib/python3.10/site-packages/py4j/java_gateway.py:1663) else:
-> [1664](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/omer/bigdata/FinalProject/~/.local/lib/python3.10/site-packages/py4j/java_gateway.py:1664)     raise Py4JError("{0} does not exist in the JVM".format(new_fqn))

Py4JError: com.johnsnowlabs.nlp.DocumentAssembler does not exist in the JVM
maziyarpanahi commented 2 months ago

as you can see there is no issue with the library, running Apache Spark alone on Windows is just very tricky. I recommend first making sure everything works with PySpark alone. Your Env seems not setup correctly for Spark.