JohnSnowLabs / spark-nlp

State of the Art Natural Language Processing
https://sparknlp.org/
Apache License 2.0
3.87k stars 711 forks source link

Following up on #11555 #12245

Closed lewisshkang closed 2 years ago

lewisshkang commented 2 years ago

Trying to get Spark NLP to work in Google Cloud Jupyterlab through dataproc and having trouble loading the pipeline (specifically "documentAssembler"). As you advised in #11555, I added the config jars package, but it does not seem to solve the problem.

I added the JAR configs as you advised, but the problem persists. Am I missing something?

Description

Developing in GCP spark NLP and have installed it ( spark-nlp==4.0.2)

Description

Expected Behavior

Current Behavior

# Initializing park session 
import sparknlp
from sparknlp.pretrained import PretrainedPipeline 
from sparknlp.base import *
from sparknlp.annotator import *

import pyspark.sql.functions as F
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.ml.linalg import *

# Initializing park session 
spark = SparkSession.builder \
  .appName('Jupyter BigQuery Storage').config('gs://spark-lib/bigquery/spark-bigquery-latest.jar')\
  .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.11:2.5.5,com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.15.1-beta")\
  .getOrCreate()

I also tried another jars config with the following, but still no success.


import sparknlp
from sparknlp.pretrained import PretrainedPipeline 
from sparknlp.base import *
from sparknlp.annotator import *

import pyspark.sql.functions as F
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.ml.linalg import *

spark = SparkSession.builder \
  .appName('Jupyter BigQuery Storage').config('gs://spark-lib/bigquery/spark-bigquery-latest.jar')\
  .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:4.0.2")\
  .getOrCreate()

spark.conf.set("spark.sql.repl.eagerEval.enabled",True)

documentAssembler = DocumentAssembler().\
                    setInputCol("title").\
                    setOutputCol("document").\
                    setCleanupMode("shrink")

Possible Solution

Steps to Reproduce

`--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Input In [4], in <cell line: 2>() 1 # Defining document assembler ----> 2 documentAssembler = DocumentAssembler().\ 3 setInputCol("title").\ 4 setOutputCol("document").\ 5 setCleanupMode("shrink")

File /usr/lib/spark/python/pyspark/init.py:114, in keyword_only..wrapper(self, *args, kwargs) 112 raise TypeError("Method %s forces keyword arguments." % func.name) 113 self._input_kwargs = kwargs --> 114 return func(self, kwargs)

File /opt/conda/miniconda3/lib/python3.8/site-packages/sparknlp/base/document_assembler.py:92, in DocumentAssembler.init(self) 90 @keyword_only 91 def init(self): ---> 92 super(DocumentAssembler, self).init(classname="com.johnsnowlabs.nlp.DocumentAssembler") 93 self._setDefault(outputCol="document", cleanupMode='disabled')

File /usr/lib/spark/python/pyspark/init.py:114, in keyword_only..wrapper(self, *args, kwargs) 112 raise TypeError("Method %s forces keyword arguments." % func.name) 113 self._input_kwargs = kwargs --> 114 return func(self, kwargs)

File /opt/conda/miniconda3/lib/python3.8/site-packages/sparknlp/internal/annotator_transformer.py:33, in AnnotatorTransformer.init(self, classname) 31 self.setParams(**kwargs) 32 self.class._java_class_name = classname ---> 33 self._java_obj = self._new_java_obj(classname, self.uid)

File /usr/lib/spark/python/pyspark/ml/wrapper.py:66, in JavaWrapper._new_java_obj(java_class, args) 64 java_obj = getattr(java_obj, name) 65 java_args = [_py2java(sc, arg) for arg in args] ---> 66 return java_obj(java_args)

TypeError: 'JavaPackage' object is not callable`

Context

Your Environment

maziyarpanahi commented 2 years ago

I am unfamiliar with Jupyter on GCP, but if you create a new notebook and just type spark if something comes back, your SparkSession is already up and running. (whatever you do with SparkSession as you can see is getOrCreate, meaning it is ignored and gets the existing SparkSession.) - just in case

That's being said, the correct Maven Package is the following

  .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:4.0.2")\

And in case the machines you are using are not compatible with .ivy2, let's try this one (after kernel restart):

  .config("spark.jars", "https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-4.0.2.jar")\

If it didn't like the https you can also download it and put it on your gcp file system. Here is my question, what is the Apache Spark version (Scala version) in your Jupyter notebook? (you might be loading the package but it's just to compatible)

lewisshkang commented 2 years ago

Thank you for the suggestions! My Apache Spark version is "3.1.3" from print(SparkContext.version).

I tried both configs (the correct Maven Package and another one you suggested) and they do not seem to solve the issue. I guess I will try downloading and putting on my gcp file system as you suggested?

maziyarpanahi commented 2 years ago

If you can leave a link to this specific service on GCP, I can look it up to see what is the best way to add third-party (external) PyPI and Maven packages to the Apache Spark clsuter. (it seems to me the Spark configs should be somewhere else, like the number of executors, the memory of each etc. and that is also a place to include anything that needs to be installed via pip and spark.jars.packages)

lewisshkang commented 2 years ago

Thank you so much for the help. I spent lots of time reading your other related answers (#232, #1220).

I dug deeper into your suggestion of trying config("spark.jars", "https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-4.0.2.jar")\

It took very long time to run in the jupyterlab in gcp, but I could just manually download it from the website "https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-4.0.2.jar". Do you know if there is a way to place the downloaded file to solve the 'JavaPackage' object is not callable issue?

For example, your answer in #232 was to find the path where the "sparknlp.jar" is located with an example of .config("spark.jars", "/Users/maziyar/anaconda3/envs/spark/lib/python3.6/site-packages/sparknlp/lib/sparknlp.jar")\

In my sparknlp, there is no sparknlp.jar.. The path I see is "C:\Users\xxx\anaconda3\Lib\site-packages\sparknlp". And in the sparknlp folder, there is no "lib" folder, only the following folders: {annotator, base, common, internal, logging, pretrained, training, pychase}.

Do you see any clue on troubleshooting my problem?

Thank you!

maziyarpanahi commented 2 years ago

I think the first question is, did spark.jars pointing to the https FAT Jar solved the issue? (even if it took long time to download it)

You can download that file from the link I gave you, https://github.com/JohnSnowLabs/spark-nlp/issues/232 is very old, we list all the FAT Jars now in each release notes so the names are there and where they are is on S3. (you can download it and have them anywhere your GCP has access to - it should be pretty quick downloading it from GCP since it's on S3 and they both have good bandwidth)

If you still see JavaPackage object is not callable then the JAR/Maven package must be loaded into SparkSession differently. I would suggest providing:

These have to be set before using the Jupyter to launch a Spark job I assume, so you can just provide those, and I will try to reproduce it on our GCP. (we use GCP this way without any issue https://github.com/JohnSnowLabs/spark-nlp#gcp-dataproc)

lewisshkang commented 2 years ago

Thank you for helping me again. spark.jarspointing to the httpsFAT Jar does not solve the issue, sadly.

I checked the full Spark configuration by running

configurations = spark.sparkContext.getConf().getAll()
for item in configurations: print(item)

And get the following:

`('spark.eventLog.enabled', 'true')
('spark.dynamicAllocation.minExecutors', '1')
('spark.sql.warehouse.dir', 'file:/spark-warehouse')
('spark.yarn.am.memory', '640m')
('spark.app.id', 'application_1660898042306_0002')
('spark.executor.cores', '4')
('spark.eventLog.dir', 'gs://dataproc-temp-europe-west2-739705788437-xbdbgl60/a9bb2ef5-fe18-4593-b271-66e65ff9e951/spark-job-history')
('spark.executor.memory', '12022m')
('spark.executor.instances', '2')
('spark.sql.autoBroadcastJoinThreshold', '90m')
('spark.serializer.objectStreamReset', '100')
('spark.yarn.unmanagedAM.enabled', 'true')
('spark.submit.deployMode', 'client')
('spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_HOSTS', 'ap-main-m')
('spark.extraListeners', 'com.google.cloud.spark.performance.DataprocMetricsListener')
('spark.ui.filters', 'org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter')
('spark.sql.cbo.joinReorder.enabled', 'true')
('spark.driver.maxResultSize', '1920m')
('spark.shuffle.service.enabled', 'true')
('spark.metrics.namespace', 'app_name:${spark.app.name}.app_id:${spark.app.id}')
('spark.scheduler.mode', 'FAIR')
('spark.sql.adaptive.enabled', 'true')
('spark.yarn.jars', 'local:/usr/lib/spark/jars/*')
('spark.scheduler.minRegisteredResourcesRatio', '0.0')
('spark.executor.id', 'driver')
('spark.hadoop.hive.execution.engine', 'mr')
('spark.app.name', 'PySparkShell')
('spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version', '2')
('spark.dynamicAllocation.maxExecutors', '10000')
('spark.master', 'yarn')
('spark.ui.port', '0')
('spark.sql.catalogImplementation', 'hive')
('spark.rpc.message.maxSize', '512')
('spark.rdd.compress', 'True')
('spark.ui.proxyBase', '/proxy/application_1660898042306_0002')
('spark.submit.pyFiles', '')
('spark.driver.memory', '3840m')
('spark.dynamicAllocation.enabled', 'true')
('spark.executorEnv.PYTHONPATH', '/usr/lib/spark/python/lib/py4j-0.10.9.3-src.zip:/usr/lib/spark/python/:<CPS>{{PWD}}/pyspark.zip<CPS>{{PWD}}/py4j-0.10.9-src.zip')
('spark.yarn.isPython', 'true')
('spark.executorEnv.OPENBLAS_NUM_THREADS', '1')
('spark.driver.port', '40615')
('spark.app.startTime', '1660898358292')
('spark.ui.showConsoleProgress', 'true')
('spark.sql.cbo.enabled', 'true')

I checked the cluster config at gcp and get the following:

Region: europe-west2
Zone: europe-west2-c
Autoscaling: Off
Dataproc Metastore: None
Scheduled deletion :Off
Master node: Standard (1 master, N workers)
  Machine type: n1-standard-4
  Number of GPUs: 0
  Primary disk type:  pd-standard
  Primary disk size:  1000GB
  Local SSDs: 0
Worker nodes:  4
  Machine type: n1-standard-8
  Number of GPUs: 0
  Primary disk type: pd-standard
  Primary disk size: 1000GB
  Local SSDs: 0
Secondary worker nodes: 0
Secure Boot: Disabled
VTPM: Disabled
Integrity Monitoring: Disabled
Network: default
Network tags: None
Internal IP only: NoRegion: europe-west2
Zone: europe-west2-c
Autoscaling: Off
Dataproc Metastore: None
Scheduled deletion :Off
Master node: Standard (1 master, N workers)
  Machine type: n1-standard-4
  Number of GPUs: 0
  Primary disk type:  pd-standard
  Primary disk size:  1000GB
  Local SSDs: 0
Worker nodes:  4
  Machine type: n1-standard-8
  Number of GPUs: 0
  Primary disk type: pd-standard
  Primary disk size: 1000GB
  Local SSDs: 0
Secondary worker nodes: 0
Secure Boot: Disabled
VTPM: Disabled
Integrity Monitoring: Disabled
Network: default
Network tags: None
Internal IP only: No
Image version : 2.0.45-debian10
Optional components: JUPYTER
Metadata: **PIP_PACKAGES**
_google-cloud-storage spark-nlp==3.4.0_
Advanced security: Disabled
Encryption type: Google-managed key
Image version : 2.0.45-debian10
Optional components: JUPYTER
Metadata: **PIP_PACKAGES**
_google-cloud-storage spark-nlp==3.4.0_
Advanced security: Disabled
Encryption type: Google-managed key

I also tried to follow the link you provided from https://github.com/JohnSnowLabs/spark-nlp#gcp-dataproc. As I already have a cluster, 1. is not applicable. May I ask how should I modify the gcp shell to install sparknlp through gcp shell, but on the existing cluster (I guess that is what you meant)? gcloud dataproc clusters create ${CLUSTER_NAME} \ --region=${REGION} \ --zone=${ZONE} \ --image-version=2.0 \ --master-machine-type=n1-standard-4 \ --worker-machine-type=n1-standard-2 \ --master-boot-disk-size=128GB \ --worker-boot-disk-size=128GB \ --num-workers=2 \ --bucket=${BUCKET_NAME} \ --optional-components=JUPYTER \ --enable-component-gateway \ --metadata 'PIP_PACKAGES=spark-nlp spark-nlp-display google-cloud-bigquery google-cloud-storage' \ --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/python/pip-install.sh \ --properties spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,spark:spark.driver.maxResultSize=0,spark:spark.kryoserializer.buffer.max=2000M,spark:spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:4.0.2

Finally, I think this issue is universal, not just in my case. Have you tried to reproduce it on your GCP recently? I am asking because we didn't have this issue of " JavaPackage object is not callable" a few months ago. So I am suspecting that recent changes from Johnsnow lab is clashing with some other packages. Maybe if you use gcp right now, you may see the same error as I do..

Thank you!

danilojsl commented 2 years ago

Hi @lewisshkang, the issue is not universal. I have a GCP cluster with spark-nlp running without any issue, either with a FAT jar or with Maven coordinates. Can you check your cluster properties? It should have something like this:

Using a FAT Jar

properties:
spark:spark.jars: gs://my-bucket/jars/sparknlp.jar

Using Maven coordinates

properties:
spark:spark.jars.packages: com.johnsnowlabs.nlp:spark-nlp_2.12:4.0.2

Based on your configuration your cluster does not have any spark.jars or spark.jars.packages properties, thus the error.

I'm not sure how to modify that on the fly, as much as I know GCP cluster creates a spark session with all the properties when starting the cluster and then you cannot modify the spark session. So, I think it's very likely you will need to ask your administrator to add one of the properties detailed above and restart the cluster.

lewisshkang commented 2 years ago

Thank you for helping me. Indeed, I checked the cluster properties, and I don't see any FAT jar or Maven. spark-env:SPARK_DAEMON_MEMORY: 3840m spark:spark.driver.maxResultSize: 1920m spark:spark.driver.memory: 3840m spark:spark.executor.cores: 4 spark:spark.executor.instances: 2 spark:spark.executor.memory: 12022m spark:spark.executorEnv.OPENBLAS_NUM_THREADS: 1 spark:spark.extraListeners com.google.cloud.spark.performance.DataprocMetricsListener: spark:spark.scheduler.mode: FAIR spark:spark.sql.cbo.enabled: true spark:spark.ui.port: 0 spark:spark.yarn.am.memory: 640m

May I ask how I can add the FAT Jar or Maven properties to the cluster? That way, I can ask for the temporary administrator status or pass this thread to the administrator (owner) with instructions.

Thank you always!

danilojsl commented 2 years ago

Each cluster has a yaml configuration file with all the setup required, you could ask your administrator to add to that yaml file the FAT Jar or Maven configuration below properties, just as I showed in the message above.

lewisshkang commented 2 years ago

Thank you! I just received the temporary status as administrator (owner) to troubleshoot the issue.

Sorry to ask you another question, but I am stuck on adding the FAT Jar (or Maven) to the properties. I tried several tweaks, but none of them works. What should I type in the Submit a job part in the dataproc? I pasted what you told me (spark:spark.jars: gs://my-bucket/jars/sparknlp.jar) and also (gs://my-bucket/jars/sparknlp.jar), but it gives me errors of "Illegal character in opaque part at index 17: spark:spark.jars: gs://my-bucket/jars/sparknlp.jar" and Error accessing gs://my-bucket/jars/sparknlp.jar

spark:spark.jars: gs://my-bucket/jars/sparknlp.jar

image

danilojsl commented 2 years ago

@lewisshkang, gs://my-bucket/jars/sparknlp.jar it's a dummy path for a GCP bucket. So, you have to:

FAT Jar:

  1. Download spark-nlp FAT jar
  2. Upload the FAT jar to some GCP bucket that your account and your cluster have access to
  3. Set spark:spark.jars as Key1 property with the bucket location that has the FAT jar (step 2) in Value1 on the Properties section (check the image)

Maven:

  1. Set spark:spark.jars.packages as Key1 and com.johnsnowlabs.nlp:spark-nlp_2.12:4.0.2 as Value1 on the Properties section (check the image) GCP
lewisshkang commented 2 years ago

@lewisshkang, gs://my-bucket/jars/sparknlp.jar it's a dummy path for a GCP bucket. So, you have to:

FAT Jar:

  1. Download spark-nlp FAT jar
  2. Upload the FAT jar to some GCP bucket that your account and your cluster have access to
  3. Set spark:spark.jars as Key1 property with the bucket location that has the FAT jar (step 2) in Value1 on the Properties section (check the image)

Maven:

  1. Set spark:spark.jars.packages as Key1 and com.johnsnowlabs.nlp:spark-nlp_2.12:4.0.2 as Value1 on the Properties section (check the image) GCP

Thank you again. I guess creating a new cluster is a better solution, right? Since [maziyarpanahi] mentioned that downloading Fat Jar manually is not ideal, as it will not automatically update. I tried creating a new cluster that connects to the same bucket, and it works well.