JohnSnowLabs / spark-nlp

State of the Art Natural Language Processing
https://sparknlp.org/
Apache License 2.0
3.84k stars 711 forks source link

Using Fat Jars behind company's firewall not viable. #2744

Closed Octavian-act closed 3 years ago

Octavian-act commented 3 years ago

Description

I have started this conversation:

https://spark-nlp.slack.com/archives/CA118BWRM/p1617225602087300

and based on the response, I have tried fat jars on my work laptop. Using the Fat Jars, it did move pass the starting session step, but it failed short in sentence detection, and there are big differences between spark-nlp 2.7.x and 3.0.x, as detailed below:

1.1. On Spark NLP version 2.7.5: got a timeout when company's VPN is enabled (on my work MACOS laptop):

spark = SparkSession.builder\
    .appName("Spark NLP")\
    .master("local[4]")\
    .config("spark.driver.memory","16G")\
    .config("spark.driver.maxResultSize", "0")\
    .config("spark.kryoserializer.buffer.max", "2000M")\
    .config("spark.jars", "/Users/filotio/Downloads/spark-nlp-assembly-2.7.5.jar")\
    .getOrCreate()
 spark

Apache Spark version: 2.4.4 Spark NLP version 2.7.5   sentence_detector_dl download started this may take some time.

Py4JJavaError                             Traceback (most recent call last)

in       1 sentencerDL = SentenceDetectorDLModel\ ----> 2     .pretrained("sentence_detector_dl", "en") \       3     .setInputCols(["document"]) \       4     .setOutputCol("sentences")       5   ~/opt/anaconda3/envs/spark-nlp/lib/python3.7/site-packages/sparknlp/annotator.py in pretrained(name, lang, remote_loc)    3095     def pretrained(name="sentence_detector_dl", lang="en", remote_loc=None):    3096         from sparknlp.pretrained import ResourceDownloader -> 3097         return ResourceDownloader.downloadModel(SentenceDetectorDLModel, name, lang, remote_loc)    3098    3099   ~/opt/anaconda3/envs/spark-nlp/lib/python3.7/site-packages/sparknlp/pretrained.py in downloadModel(reader, name, language, remote_loc, j_dwn)      30     def downloadModel(reader, name, language, remote_loc=None, j_dwn='PythonResourceDownloader'):      31         print(name + " download started this may take some time.") ---> 32         file_size = _internal._GetResourceSize(name, language, remote_loc).apply()      33         if file_size == "-1":      34             print("Can not find the model to download please check the name!")   ~/opt/anaconda3/envs/spark-nlp/lib/python3.7/site-packages/sparknlp/internal.py in __init__(self, name, language, remote_loc)     190     def __init__(self, name, language, remote_loc):     191         super(_GetResourceSize, self).__init__( --> 192             "com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize", name, language, remote_loc)     193     194   ~/opt/anaconda3/envs/spark-nlp/lib/python3.7/site-packages/sparknlp/internal.py in __init__(self, java_obj, *args)     127         super(ExtendedJavaWrapper, self).__init__(java_obj)     128         self.sc = SparkContext._active_spark_context --> 129         self._java_obj = self.new_java_obj(java_obj, *args)     130         self.java_obj = self._java_obj     131   ~/opt/anaconda3/envs/spark-nlp/lib/python3.7/site-packages/sparknlp/internal.py in new_java_obj(self, java_class, *args)     137     138     def new_java_obj(self, java_class, *args): --> 139         return self._new_java_obj(java_class, *args)     140     141     def new_java_array(self, pylist, java_class):   ~/opt/anaconda3/envs/spark-nlp/lib/python3.7/site-packages/pyspark/ml/wrapper.py in _new_java_obj(java_class, *args)      65             java_obj = getattr(java_obj, name)      66         java_args = [_py2java(sc, arg) for arg in args] ---> 67         return java_obj(*java_args)      68      69     @staticmethod   ~/opt/anaconda3/envs/spark-nlp/lib/python3.7/site-packages/py4j/java_gateway.py in __call__(self, *args)    1255         answer = self.gateway_client.send_command(command)    1256         return_value = get_return_value( -> 1257             answer, self.gateway_client, self.target_id, self.name)    1258    1259         for temp_arg in temp_args:   ~/opt/anaconda3/envs/spark-nlp/lib/python3.7/site-packages/pyspark/sql/utils.py in deco(*a, **kw)      61     def deco(*a, **kw):      62         try: ---> 63             return f(*a, **kw)      64         except py4j.protocol.Py4JJavaError as e:      65             s = e.java_exception.toString()   ~/opt/anaconda3/envs/spark-nlp/lib/python3.7/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)     326                 raise Py4JJavaError(     327                     "An error occurred while calling {0}{1}{2}.\n". --> 328                     format(target_id, ".", name), value)     329             else:     330                 raise Py4JError(   Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize. : com.amazonawsShadedAmazonClientException: Unable to execute HTTP request: Connect to auxdata.johnsnowlabs.com.s3.amazonaws.com:443 timed out         at com.amazonawsShadedhttp.AmazonHttpClient.executeHelper(AmazonHttpClient.java:454)         at com.amazonawsShadedhttp.AmazonHttpClient.execute(AmazonHttpClient.java:232)         at com.amazonawsShadedservices.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)         at com.amazonawsShadedservices.s3.AmazonS3Client.getObject(AmazonS3Client.java:1111)         at com.amazonawsShadedservices.s3.AmazonS3Client.getObject(AmazonS3Client.java:984)         at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.downloadMetadataIfNeed(S3ResourceDownloader.scala:69)         at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.resolveLink(S3ResourceDownloader.scala:81)         at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.getDownloadSize(S3ResourceDownloader.scala:159)         at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.getDownloadSize(ResourceDownloader.scala:401)         at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.getDownloadSize(ResourceDownloader.scala:501)         at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize(ResourceDownloader.scala)         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)         at java.lang.reflect.Method.invoke(Method.java:498)         at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)         at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)         at py4j.Gateway.invoke(Gateway.java:282)         at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)         at py4j.commands.CallCommand.execute(CallCommand.java:79)         at py4j.GatewayConnection.run(GatewayConnection.java:238)         at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.httpShadedconn.ConnectTimeoutException: Connect to auxdata.johnsnowlabs.com.s3.amazonaws.com:443 timed out         at org.apache.httpShadedconn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:551)         at org.apache.httpShadedimpl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:180)         at org.apache.httpShadedimpl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:294)         at org.apache.httpShadedimpl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:641)         at org.apache.httpShadedimpl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:480)         at org.apache.httpShadedimpl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)         at org.apache.httpShadedimpl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)         at com.amazonawsShadedhttp.AmazonHttpClient.executeHelper(AmazonHttpClient.java:384)         ... 21 more 1.2. However, once I disable the company's VPN, the above call to SentenceDetectorDLModel works! 2.1. Using Spark NLP version 3.0.1 I get a NullPointerException back: ``` spark = SparkSession.builder\     .appName("Spark NLP")\     .master("local[4]")\     .config("spark.driver.memory","16G")\     .config("spark.driver.maxResultSize", "0")\     .config("spark.kryoserializer.buffer.max", "2000M")\     .config("spark.jars", "/Users/filotio/Downloads/spark-nlp-assembly-3.0.1.jar")\     .getOrCreate()  spark ``` Apache Spark version: 3.1.1 Spark NLP version 3.0.1 sentence_detector_dl download started this may take some time. --------------------------------------------------------------------------- Py4JJavaError                             Traceback (most recent call last) in       1 sentencerDL = SentenceDetectorDLModel\ ----> 2     .pretrained("sentence_detector_dl", "en") \       3     .setInputCols(["document"]) \       4     .setOutputCol("sentences")       5   ~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/sparknlp/annotator.py in pretrained(name, lang, remote_loc)    3107     def pretrained(name="sentence_detector_dl", lang="en", remote_loc=None):    3108         from sparknlp.pretrained import ResourceDownloader -> 3109         return ResourceDownloader.downloadModel(SentenceDetectorDLModel, name, lang, remote_loc)    3110    3111   ~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/sparknlp/pretrained.py in downloadModel(reader, name, language, remote_loc, j_dwn)      30     def downloadModel(reader, name, language, remote_loc=None, j_dwn='PythonResourceDownloader'):      31         print(name + " download started this may take some time.") ---> 32         file_size = _internal._GetResourceSize(name, language, remote_loc).apply()      33         if file_size == "-1":      34             print("Can not find the model to download please check the name!")   ~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/sparknlp/internal.py in __init__(self, name, language, remote_loc)     190     def __init__(self, name, language, remote_loc):     191         super(_GetResourceSize, self).__init__( --> 192             "com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize", name, language, remote_loc)     193     194   ~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/sparknlp/internal.py in __init__(self, java_obj, *args)     127         super(ExtendedJavaWrapper, self).__init__(java_obj)     128         self.sc = SparkContext._active_spark_context --> 129         self._java_obj = self.new_java_obj(java_obj, *args)     130         self.java_obj = self._java_obj     131   ~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/sparknlp/internal.py in new_java_obj(self, java_class, *args)     137     138     def new_java_obj(self, java_class, *args): --> 139         return self._new_java_obj(java_class, *args)     140     141     def new_java_array(self, pylist, java_class):   ~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/pyspark/ml/wrapper.py in _new_java_obj(java_class, *args)      64             java_obj = getattr(java_obj, name)      65         java_args = [_py2java(sc, arg) for arg in args] ---> 66         return java_obj(*java_args)      67      68     @staticmethod   ~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/py4j/java_gateway.py in __call__(self, *args)    1303         answer = self.gateway_client.send_command(command)    1304         return_value = get_return_value( -> 1305             answer, self.gateway_client, self.target_id, self.name)    1306    1307         for temp_arg in temp_args:   ~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/pyspark/sql/utils.py in deco(*a, **kw)     109     def deco(*a, **kw):     110         try: --> 111             return f(*a, **kw)     112         except py4j.protocol.Py4JJavaError as e:     113             converted = convert_exception(e.java_exception)   ~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)     326                 raise Py4JJavaError(     327                     "An error occurred while calling {0}{1}{2}.\n". --> 328                     format(target_id, ".", name), value)     329             else:     330                 raise Py4JError(   Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize. : java.lang.NullPointerException         at com.amazonaws.ShadedByJSLClientConfiguration.getProxyUsernameEnvironment(ClientConfiguration.java:874)         at com.amazonaws.ShadedByJSLClientConfiguration.getProxyUsername(ClientConfiguration.java:902)         at com.amazonaws.ShadedByJSLhttp.settings.HttpClientSettings.getProxyUsername(HttpClientSettings.java:90)         at com.amazonaws.ShadedByJSLhttp.settings.HttpClientSettings.isAuthenticatedProxy(HttpClientSettings.java:182)         at com.amazonaws.ShadedByJSLhttp.apache.client.impl.ApacheHttpClientFactory.addProxyConfig(ApacheHttpClientFactory.java:96)         at com.amazonaws.ShadedByJSLhttp.apache.client.impl.ApacheHttpClientFactory.create(ApacheHttpClientFactory.java:75)         at com.amazonaws.ShadedByJSLhttp.apache.client.impl.ApacheHttpClientFactory.create(ApacheHttpClientFactory.java:38)         at com.amazonaws.ShadedByJSLhttp.AmazonHttpClient.(AmazonHttpClient.java:324)         at com.amazonaws.ShadedByJSLhttp.AmazonHttpClient.(AmazonHttpClient.java:308)         at com.amazonaws.ShadedByJSLAmazonWebServiceClient.(AmazonWebServiceClient.java:229)         at com.amazonaws.ShadedByJSLAmazonWebServiceClient.(AmazonWebServiceClient.java:181)         at com.amazonaws.ShadedByJSLservices.s3.AmazonS3Client.(AmazonS3Client.java:617)         at com.amazonaws.ShadedByJSLservices.s3.AmazonS3Client.(AmazonS3Client.java:597)         at com.amazonaws.ShadedByJSLservices.s3.AmazonS3Client.(AmazonS3Client.java:575)         at com.amazonaws.ShadedByJSLservices.s3.AmazonS3Client.(AmazonS3Client.java:542)         at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.client$lzycompute(S3ResourceDownloader.scala:45)         at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.client(S3ResourceDownloader.scala:36)         at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.downloadMetadataIfNeed(S3ResourceDownloader.scala:69)         at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.resolveLink(S3ResourceDownloader.scala:81)         at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.getDownloadSize(S3ResourceDownloader.scala:159)         at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.getDownloadSize(ResourceDownloader.scala:401)         at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.getDownloadSize(ResourceDownloader.scala:501)         at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize(ResourceDownloader.scala)         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)         at java.lang.reflect.Method.invoke(Method.java:498)         at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)         at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)         at py4j.Gateway.invoke(Gateway.java:282)         at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)         at py4j.commands.CallCommand.execute(CallCommand.java:79)         at py4j.GatewayConnection.run(GatewayConnection.java:238)         at java.lang.Thread.run(Thread.java:748) 2.2. If I disaable company's VPN, I get the same NullPointerException as above - 2.1. ## Expected Behavior I would like to use your code behind company's firewall, and more importantly from AWS SageMaker. I do test it first on my work laptop, so I like to have it working there as well. ## Current Behavior Not working, got a healthcare temp license, which expires in a couple of days, and so far I was not able to run any of your code behind company's firewall. So, setting the spark-nlp session using the Fat Jars: when using a pretrain model such as: sentencerDL = SentenceDetectorDLModel\ .pretrained("sentence_detector_dl", "en") \ .setInputCols(["document"]) \ .setOutputCol("sentences") it fails. ## Possible Solution Like the idea of using Fat Jars, but need them functional. ## Steps to Reproduce tested on my work macos catalina latest version using the installation instructions: https://nlp.johnsnowlabs.com/docs/en/install#python for both: $ java -version $ conda create -n sparknlp python=3.7 -y $ conda activate sparknlp $ pip install spark-nlp==3.0.1 pyspark==3.1.1 $ pip install jupyter $ jupyter notebook and $ java -version $ conda create -n spark-nlp python=3.7 -y $ conda activate spark-nlp $ pip install spark-nlp==2.7.5 pyspark==2.4.4 $ pip install jupyter $ jupyter notebook Pretty much follow the code from: https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/9.SentenceDetectorDL.ipynb#scrollTo=KvNuyGXpD7Nt but using the Fat Jars instead: spark = SparkSession.builder\ .appName("Spark NLP")\ .master("local[4]")\ .config("spark.driver.memory","16G")\ .config("spark.driver.maxResultSize", "0")\ .config("spark.kryoserializer.buffer.max", "2000M")\ .config("spark.jars", "/Users/filotio/Downloads/spark-nlp-assembly-3.0.1.jar")\ .getOrCreate() and the moment I hit this code: sentencerDL = SentenceDetectorDLModel\ .pretrained("sentence_detector_dl", "en") \ .setInputCols(["document"]) \ .setOutputCol("sentences") I get the above errors (NullPointerException for spark-nlp 3.0.x and timing out for spark-nlp 2.7.x) ## Context ## Your Environment * Spark NLP version `sparknlp.version()`: Spark NLP version 3.0.1 * Apache NLP version `spark.version`: Apache Spark version: 3.1.1 * Java version `java -version`: openjdk version "1.8.0_282" OpenJDK Runtime Environment (build 1.8.0_282-bre_2021_01_20_16_37-b00) OpenJDK 64-Bit Server VM (build 25.282-b00, mixed mode) * Conda latest release. * Operating System and version: MacOS catalina, latest release.
maziyarpanahi commented 3 years ago

Hi,

As mentioned in the conversation, this is not an issue we can ever solve. If you don’t have access to Internet whether you are behind proxy or firewall, you cannot use Maven coordinates since it requires to download dependencies like any other Maven package. So you use Fat JAR. As I mentioned, you also meed access to S3 if you are using pretrained() because it needs to download models/pipelines. If you don’t, you need to download them, extract them and use .load() instead.

We have many users in Healthcare which they must be in air-gaped environment with zero access to internet. So they go with Fat JAR and offline .load().

Either choose Fat JAR and manually download/load models or please use Google Colab/Kaggle for testings. You cannot download via pretrained if you don’t have any access and we cannot do anything about that. (Either total offline solution, or ask your admin to whitelist some stuff for you)

: com.amazonawsShadedAmazonClientException: Unable to execute HTTP request: Connect to auxdata.johnsnowlabs.com.s3.amazonaws.com:443 timed out

Octavian-act commented 3 years ago

I understand, but please, read carefully, you do have a bug: the latest release, 3.0.x does not work, I get a NullPointerException, regardless if I have my VPN enabled or disabled (see 2.2), whereas for 2.7.5 it works - it does download the pretrained model once I disabled the VPN, but this is not the case for your latest release (I pasted 3.0.1 above because that was the latest trial, but I did try 3.0.0 as well).

maziyarpanahi commented 3 years ago

Fair enough. But locally and in the Colab I cannot reproduce this and I might have not read the whole post correctly since it's not formatted correctly:

Here is an example of what you are doing: https://colab.research.google.com/drive/1Jd6prxgvZdVORLTn0u-VfPTS3Ex_rxWt?usp=sharing

As you can see it works just fine, nothing has changed from 2.7.x to 3.x regarding pretrained models and our Fat JARs. The only thing I can find in there is that you constantly referring the followings together:

pip install spark-nlp==2.7.5 pyspark==2.4.4

and

.config("spark.jars", "/Users/filotio/Downloads/spark-nlp-assembly-3.0.1.jar")

If you are using spark-nlp-assembly you must install pyspark 3.x, and if you are using a jar that has 3.0.1 you also have to install spark-nlp==3.0.1 as well. Maybe it's a copy/paste mistake but be sure you have a clean environment, properly upgrading pyspark to 3.x, properly upgrading spark-nlp to 3.0.1 before using the spark-nlp-assembly-3.0.1.jar.

Octavian-act commented 3 years ago

I have two separate clean environments for both 2.7.5 and 3.0.1 versions, so no mixture there. Following exactly your example above: https://colab.research.google.com/drive/1Jd6prxgvZdVORLTn0u-VfPTS3Ex_rxWt?usp=sharing I still get NullPointerException on my macos:

! java -version ! echo $JAVA_HOME ! which java ! SPARKHOME="/Users/filotio/Downloads/spark-3.1.1-bin-hadoop2.7" ! export SPARK_HOME=$SPARKHOME

getting: openjdk version "1.8.0_282" OpenJDK Runtime Environment (build 1.8.0_282-bre_2021_01_20_16_37-b00) OpenJDK 64-Bit Server VM (build 25.282-b00, mixed mode) /Library/Java/JavaVirtualMachines/openjdk-8.jdk/Contents/Home /usr/local/opt/openjdk@8/bin/java

import sparknlp from pyspark.sql import SparkSession spark = SparkSession.builder\ .appName("Spark NLP")\ .master("local[*]")\ .config("spark.driver.memory","16G")\ .config("spark.driver.maxResultSize", "0") \ .config("spark.kryoserializer.buffer.max", "2000M")\ .config("spark.jars", "/Users/filotio/Downloads/spark-nlp-assembly-3.0.1.jar")\ .getOrCreate() print("Apache Spark version:", spark.version) print("Spark NLP version", sparknlp.version()) spark

getting: Apache Spark version: 3.1.1 Spark NLP version 3.0.1

SparkSession - in-memory

SparkContext

Spark UI

Version v3.1.1 Master local[*] AppName Spark NLP

and now when:

from sparknlp.base import from sparknlp.annotator import

sentencerDL = SentenceDetectorDLModel\ .pretrained("sentence_detector_dl", "en") \ .setInputCols(["document"]) \ .setOutputCol("sentences")

getting: sentence_detector_dl download started this may take some time.


Py4JJavaError Traceback (most recent call last)

in 4 5 sentencerDL = SentenceDetectorDLModel\ ----> 6 .pretrained("sentence_detector_dl", "en") \ 7 .setInputCols(["document"]) \ 8 .setOutputCol("sentences") ~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/sparknlp/annotator.py in pretrained(name, lang, remote_loc) 3107 def pretrained(name="sentence_detector_dl", lang="en", remote_loc=None): 3108 from sparknlp.pretrained import ResourceDownloader -> 3109 return ResourceDownloader.downloadModel(SentenceDetectorDLModel, name, lang, remote_loc) 3110 3111 ~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/sparknlp/pretrained.py in downloadModel(reader, name, language, remote_loc, j_dwn) 30 def downloadModel(reader, name, language, remote_loc=None, j_dwn='PythonResourceDownloader'): 31 print(name + " download started this may take some time.") ---> 32 file_size = _internal._GetResourceSize(name, language, remote_loc).apply() 33 if file_size == "-1": 34 print("Can not find the model to download please check the name!") ~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/sparknlp/internal.py in __init__(self, name, language, remote_loc) 190 def __init__(self, name, language, remote_loc): 191 super(_GetResourceSize, self).__init__( --> 192 "com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize", name, language, remote_loc) 193 194 ~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/sparknlp/internal.py in __init__(self, java_obj, *args) 127 super(ExtendedJavaWrapper, self).__init__(java_obj) 128 self.sc = SparkContext._active_spark_context --> 129 self._java_obj = self.new_java_obj(java_obj, *args) 130 self.java_obj = self._java_obj 131 ~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/sparknlp/internal.py in new_java_obj(self, java_class, *args) 137 138 def new_java_obj(self, java_class, *args): --> 139 return self._new_java_obj(java_class, *args) 140 141 def new_java_array(self, pylist, java_class): ~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/pyspark/ml/wrapper.py in _new_java_obj(java_class, *args) 64 java_obj = getattr(java_obj, name) 65 java_args = [_py2java(sc, arg) for arg in args] ---> 66 return java_obj(*java_args) 67 68 @staticmethod ~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/py4j/java_gateway.py in __call__(self, *args) 1303 answer = self.gateway_client.send_command(command) 1304 return_value = get_return_value( -> 1305 answer, self.gateway_client, self.target_id, self.name) 1306 1307 for temp_arg in temp_args: ~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/pyspark/sql/utils.py in deco(*a, **kw) 109 def deco(*a, **kw): 110 try: --> 111 return f(*a, **kw) 112 except py4j.protocol.Py4JJavaError as e: 113 converted = convert_exception(e.java_exception) ~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 326 raise Py4JJavaError( 327 "An error occurred while calling {0}{1}{2}.\n". --> 328 format(target_id, ".", name), value) 329 else: 330 raise Py4JError( Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize. : java.lang.NullPointerException at com.amazonaws.ShadedByJSLClientConfiguration.getProxyUsernameEnvironment(ClientConfiguration.java:874) at com.amazonaws.ShadedByJSLClientConfiguration.getProxyUsername(ClientConfiguration.java:902) at com.amazonaws.ShadedByJSLhttp.settings.HttpClientSettings.getProxyUsername(HttpClientSettings.java:90) at com.amazonaws.ShadedByJSLhttp.settings.HttpClientSettings.isAuthenticatedProxy(HttpClientSettings.java:182) at com.amazonaws.ShadedByJSLhttp.apache.client.impl.ApacheHttpClientFactory.addProxyConfig(ApacheHttpClientFactory.java:96) at com.amazonaws.ShadedByJSLhttp.apache.client.impl.ApacheHttpClientFactory.create(ApacheHttpClientFactory.java:75) at com.amazonaws.ShadedByJSLhttp.apache.client.impl.ApacheHttpClientFactory.create(ApacheHttpClientFactory.java:38) at com.amazonaws.ShadedByJSLhttp.AmazonHttpClient.(AmazonHttpClient.java:324) at com.amazonaws.ShadedByJSLhttp.AmazonHttpClient.(AmazonHttpClient.java:308) at com.amazonaws.ShadedByJSLAmazonWebServiceClient.(AmazonWebServiceClient.java:229) at com.amazonaws.ShadedByJSLAmazonWebServiceClient.(AmazonWebServiceClient.java:181) at com.amazonaws.ShadedByJSLservices.s3.AmazonS3Client.(AmazonS3Client.java:617) at com.amazonaws.ShadedByJSLservices.s3.AmazonS3Client.(AmazonS3Client.java:597) at com.amazonaws.ShadedByJSLservices.s3.AmazonS3Client.(AmazonS3Client.java:575) at com.amazonaws.ShadedByJSLservices.s3.AmazonS3Client.(AmazonS3Client.java:542) at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.client$lzycompute(S3ResourceDownloader.scala:45) at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.client(S3ResourceDownloader.scala:36) at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.downloadMetadataIfNeed(S3ResourceDownloader.scala:69) at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.resolveLink(S3ResourceDownloader.scala:81) at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.getDownloadSize(S3ResourceDownloader.scala:159) at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.getDownloadSize(ResourceDownloader.scala:401) at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.getDownloadSize(ResourceDownloader.scala:501) at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize(ResourceDownloader.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) So, MACOS is not a friend of your product :-)
maziyarpanahi commented 3 years ago

I only shared Colab because it's the only environment that is the same for anyone in the entire world. Also, it's the only thing I can share to show you there is no issue with Fat JAR and pretrained models. My primary laptop is a MacBook Pro and I have been using macOS for the last 10 years for all of my projects. (also macOS is Unix based, so technically what works in Linux it works in macOS, Windows is not a friendly environment when it comes to Spark)

Anyway, I couldn't reproduce this issue on Ubuntu 16, 18, 20, macOS Big Sur, and Windows 10! Please let me know if I miss anything. If I can find a firewall or a proxy I would try that, but I could only suggest:

These are not a final solution obviously, but to zeroing down to an actual issue (regardless of who should fix it) so I can reproduce it. Just a Fat JAR and .pretrained() has never been an issue in any environment.

maziyarpanahi commented 3 years ago

One more thing, in 2.7.5, the aws-java-sdk was 1.7.4 which was really old. In 3.0.1 we are using 1.11.603 which might have some bugs when it is used with proxy.

If you are interested, I can make a Fat JAR with a newer aws-java-sdk and let you test it. Would you be interested?

Octavian-act commented 3 years ago

Ohh, nice, yes I am interested, if you can build that, let me know. As you can imagine, my work macos has extra protections compare to "normal" one, and since the devops team was slow in enabling SparkNLP on AWS, I am trying to test locally and help them :-)

maziyarpanahi commented 3 years ago

Very nice, since you have an environment that can reproduce this error so I'll send you a Fat JAR to test on Slack. I think the issue is this: https://github.com/aws/aws-sdk-java/issues/2070 (apparently it happened only in 1.11.603) It was also reported here a while back: https://github.com/JohnSnowLabs/spark-nlp/issues/1174

Octavian-act commented 3 years ago

Yes, most likely seems to be that issue. We do set those proxy env. vars. I do not have any issues on my private macos, have just tried (was harder to install Java 8) :-)

Octavian-act commented 3 years ago

@maziyarpanahi asking here (Slack unavailable behind proxy :-)): https://nlp.johnsnowlabs.com/docs/en/models How can I download the models, from where (for offline use)? Many thanks

maziyarpanahi commented 3 years ago

Oh, I sent you the Fat Jar via Slack. I'll put it on S3 and share the link. All of our models/pipelines are here: https://nlp.johnsnowlabs.com/models

They all have examples and download link inside.

Octavian-act commented 3 years ago

Tried on AWS, not working, but I cannot change the SparkNLP version:

  1. installations of `! sudo yum install java-1.8.0-openjdk

import os os.environ["JAVA_HOME"] = "/home/ec2-user/anaconda3/envs/JupyterSystemEnv" os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]

! java -version ! echo $JAVA_HOME ! which java

! SPARKHOME="/home/ec2-user/SageMaker/spark-3.1.1-bin-hadoop2.7" ! export SPARK_HOME=$SPARKHOME

! pip install --upgrade pyspark==3.1.1 spark-nlp==3.0.1 findspark ! pip install --upgrade spark-nlp==3.0.1 ! pip install --upgrade findspark`

response: Loaded plugins: dkms-build-requires, priorities, update-motd, upgrade-helper, : versionlock amzn-main | 2.1 kB 00:00 amzn-updates | 3.8 kB 00:00 Package 1:java-1.8.0-openjdk-1.8.0.282.b08-1.61.amzn1.x86_64 already installed and latest version Nothing to do openjdk version "1.8.0_265" OpenJDK Runtime Environment (Zulu 8.48.0.53-CA-linux64) (build 1.8.0_265-b11) OpenJDK 64-Bit Server VM (Zulu 8.48.0.53-CA-linux64) (build 25.265-b11, mixed mode) /home/ec2-user/anaconda3/envs/JupyterSystemEnv /home/ec2-user/anaconda3/envs/JupyterSystemEnv/bin/java Requirement already satisfied: pyspark==3.1.1 in /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages (3.1.1) Requirement already satisfied: py4j==0.10.9 in /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages (from pyspark==3.1.1) (0.10.9) Requirement already satisfied: spark-nlp==3.0.1 in /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages (3.0.1) Requirement already satisfied: findspark in /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages (1.4.2)

and now for: `from pyspark.sql import SparkSession import sparknlp

! ls -ltr /home/ec2-user/SageMaker/spark-nlp-assembly-aws-fix.jar print("Spark NLP version", sparknlp.version())

spark = SparkSession.builder\ .appName("Spark NLP")\ .master("local[*]")\ .config("spark.driver.memory","16G")\ .config("spark.driver.maxResultSize", "0") \ .config("spark.kryoserializer.buffer.max", "2000M")\ .config("spark.jars", "/home/ec2-user/SageMaker/spark-nlp-assembly-aws-fix.jar")\ .getOrCreate()

print("Apache Spark version:", spark.version) spark`

I am getting Spark NLP version 2.7.4 and cannot figure why: `-rw-rw-r-- 1 ec2-user ec2-user 453870407 Apr 9 22:19 /home/ec2-user/SageMaker/spark-nlp-assembly-aws-fix.jar Spark NLP version 2.7.4


Exception Traceback (most recent call last)

in 11 .config("spark.driver.maxResultSize", "0") \ 12 .config("spark.kryoserializer.buffer.max", "2000M")\ ---> 13 .config("spark.jars", "/home/ec2-user/SageMaker/spark-nlp-assembly-aws-fix.jar")\ 14 .getOrCreate() 15 # .config("spark.jars", "https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.0.1.jar")\ ~/anaconda3/envs/python3/lib/python3.6/site-packages/pyspark/sql/session.py in getOrCreate(self) 171 for key, value in self._options.items(): 172 sparkConf.set(key, value) --> 173 sc = SparkContext.getOrCreate(sparkConf) 174 # This SparkContext may be an existing one. 175 for key, value in self._options.items(): ~/anaconda3/envs/python3/lib/python3.6/site-packages/pyspark/context.py in getOrCreate(cls, conf) 361 with SparkContext._lock: 362 if SparkContext._active_spark_context is None: --> 363 SparkContext(conf=conf or SparkConf()) 364 return SparkContext._active_spark_context 365 ~/anaconda3/envs/python3/lib/python3.6/site-packages/pyspark/context.py in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls) 127 " note this option will be removed in Spark 3.0") 128 --> 129 SparkContext._ensure_initialized(self, gateway=gateway, conf=conf) 130 try: 131 self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer, ~/anaconda3/envs/python3/lib/python3.6/site-packages/pyspark/context.py in _ensure_initialized(cls, instance, gateway, conf) 310 with SparkContext._lock: 311 if not SparkContext._gateway: --> 312 SparkContext._gateway = gateway or launch_gateway(conf) 313 SparkContext._jvm = SparkContext._gateway.jvm 314 ~/anaconda3/envs/python3/lib/python3.6/site-packages/pyspark/java_gateway.py in launch_gateway(conf) 44 :return: a JVM gateway 45 """ ---> 46 return _launch_gateway(conf) 47 48 ~/anaconda3/envs/python3/lib/python3.6/site-packages/pyspark/java_gateway.py in _launch_gateway(conf, insecure) 106 107 if not os.path.isfile(conn_info_file): --> 108 raise Exception("Java gateway process exited before sending its port number") 109 110 with open(conn_info_file, "rb") as info: Exception: Java gateway process exited before sending its port number`
Octavian-act commented 3 years ago

For my work macos:

when calling: from sparknlp.base import * from sparknlp.annotator import * sentencerDL = SentenceDetectorDLModel.load("/Users/filotio/Documents/BMS/Vendors/JSL/Models/sentence_detector_dl_en_2.7.0_2.4_1609611052663")\ .setInputCols(["document"])\ .setOutputCol("sentences")

getting back: `ERROR:root:Exception while sending command. Traceback (most recent call last): File "/Users/filotio/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/py4j/java_gateway.py", line 1207, in send_command raise Py4JNetworkError("Answer from Java side is empty") py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/Users/filotio/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/py4j/java_gateway.py", line 1033, in send_command response = connection.send_command(command) File "/Users/filotio/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/py4j/java_gateway.py", line 1212, in send_command "Error while receiving", e, proto.ERROR_ON_RECEIVE) py4j.protocol.Py4JNetworkError: Error while receiving


Py4JError Traceback (most recent call last)

in ----> 9 sentencerDL = SentenceDetectorDLModel.load("/Users/filotio/Documents/BMS/Vendors/JSL/Models/sentence_detector_dl_en_2.7.0_2.4_1609611052663")\ 10 .setInputCols(["document"])\ 11 .setOutputCol("sentences") ~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/pyspark/ml/util.py in load(cls, path) 330 def load(cls, path): 331 """Reads an ML instance from the input path, a shortcut of `read().load(path)`.""" --> 332 return cls.read().load(path) 333 334 ~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/pyspark/ml/util.py in load(self, path) 280 if not isinstance(path, str): 281 raise TypeError("path should be a string, got type %s" % type(path)) --> 282 java_obj = self._jread.load(path) 283 if not hasattr(self._clazz, "_from_java"): 284 raise NotImplementedError("This Java ML type cannot be loaded into Python currently: %r" ~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/py4j/java_gateway.py in __call__(self, *args) 1303 answer = self.gateway_client.send_command(command) 1304 return_value = get_return_value( -> 1305 answer, self.gateway_client, self.target_id, self.name) 1306 1307 for temp_arg in temp_args: ~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/pyspark/sql/utils.py in deco(*a, **kw) 109 def deco(*a, **kw): 110 try: --> 111 return f(*a, **kw) 112 except py4j.protocol.Py4JJavaError as e: 113 converted = convert_exception(e.java_exception) ~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 334 raise Py4JError( 335 "An error occurred while calling {0}{1}{2}". --> 336 format(target_id, ".", name)) 337 else: 338 type = answer[1] Py4JError: An error occurred while calling o78.load `
maziyarpanahi commented 3 years ago

Hi, Let’s focus on one issue at a time. The first one is aws, second one is .load. (we can get to these later)

Could you please try the custom Fat JAR to see if the initial reported error has been solved by using pretrained() like it used to work in 2.7.5?

Octavian-act commented 3 years ago

When using Conda base env (Python 3.8.5) and not sparknlp (Python 3.7.10), it fails in:

spark = SparkSession.builder\
    .appName("Spark NLP")\
    .master("local[*]")\
    .config("spark.driver.memory","16G")\
    .config("spark.driver.maxResultSize", "0") \
    .config("spark.kryoserializer.buffer.max", "2000M")\
    .config("spark.jars", "/Users/filotio/Downloads/spark-nlp-assembly-aws-fix.jar")\
    .getOrCreate()

Using sparknlp conda env, which has:

openjdk version "1.8.0_282"
OpenJDK Runtime Environment (build 1.8.0_282-bre_2021_01_20_16_37-b00)
OpenJDK 64-Bit Server VM (build 25.282-b00, mixed mode)
Spark NLP version 3.0.1
Apache Spark version: 3.1.1

When executing:

sentencerDL = SentenceDetectorDLModel\
    .pretrained("sentence_detector_dl", "en") \
    .setInputCols(["document"]) \
    .setOutputCol("sentences")

I get back:

sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[ \ ]

ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/Users/filotio/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/py4j/java_gateway.py", line 1207, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/filotio/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/py4j/java_gateway.py", line 1033, in send_command
    response = connection.send_command(command)
  File "/Users/filotio/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/py4j/java_gateway.py", line 1212, in send_command
    "Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving

[OK!]

---------------------------------------------------------------------------
Py4JError                                 Traceback (most recent call last)
<ipython-input-3-0ddcc01e0819> in <module>
      4 
      5 sentencerDL = SentenceDetectorDLModel\
----> 6     .pretrained("sentence_detector_dl", "en") \
      7     .setInputCols(["document"]) \
      8     .setOutputCol("sentences")

~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/sparknlp/annotator.py in pretrained(name, lang, remote_loc)
   3107     def pretrained(name="sentence_detector_dl", lang="en", remote_loc=None):
   3108         from sparknlp.pretrained import ResourceDownloader
-> 3109         return ResourceDownloader.downloadModel(SentenceDetectorDLModel, name, lang, remote_loc)
   3110 
   3111 

~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/sparknlp/pretrained.py in downloadModel(reader, name, language, remote_loc, j_dwn)
     39             t1.start()
     40             try:
---> 41                 j_obj = _internal._DownloadModel(reader.name, name, language, remote_loc, j_dwn).apply()
     42             finally:
     43                 stop_threads = True

~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/sparknlp/internal.py in __init__(self, reader, name, language, remote_loc, validator)
    174 class _DownloadModel(ExtendedJavaWrapper):
    175     def __init__(self, reader, name, language, remote_loc, validator):
--> 176         super(_DownloadModel, self).__init__("com.johnsnowlabs.nlp.pretrained."+validator+".downloadModel", reader, name, language, remote_loc)
    177 
    178 

~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/sparknlp/internal.py in __init__(self, java_obj, *args)
    127         super(ExtendedJavaWrapper, self).__init__(java_obj)
    128         self.sc = SparkContext._active_spark_context
--> 129         self._java_obj = self.new_java_obj(java_obj, *args)
    130         self.java_obj = self._java_obj
    131 

~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/sparknlp/internal.py in new_java_obj(self, java_class, *args)
    137 
    138     def new_java_obj(self, java_class, *args):
--> 139         return self._new_java_obj(java_class, *args)
    140 
    141     def new_java_array(self, pylist, java_class):

~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/pyspark/ml/wrapper.py in _new_java_obj(java_class, *args)
     64             java_obj = getattr(java_obj, name)
     65         java_args = [_py2java(sc, arg) for arg in args]
---> 66         return java_obj(*java_args)
     67 
     68     @staticmethod

~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/py4j/java_gateway.py in __call__(self, *args)
   1303         answer = self.gateway_client.send_command(command)
   1304         return_value = get_return_value(
-> 1305             answer, self.gateway_client, self.target_id, self.name)
   1306 
   1307         for temp_arg in temp_args:

~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
    109     def deco(*a, **kw):
    110         try:
--> 111             return f(*a, **kw)
    112         except py4j.protocol.Py4JJavaError as e:
    113             converted = convert_exception(e.java_exception)

~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    334             raise Py4JError(
    335                 "An error occurred while calling {0}{1}{2}".
--> 336                 format(target_id, ".", name))
    337     else:
    338         type = answer[1]

Py4JError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel

and from the terminal I see:

ERROR:asyncio:Exception in callback <TaskWakeupMethWrapper object at 0x7f8007a48510>(<Future finis...b06"\r\n\r\n'>)
handle: <Handle <TaskWakeupMethWrapper object at 0x7f8007a48510>(<Future finis...b06"\r\n\r\n'>)>
Traceback (most recent call last):
  File "/Users/filotio/opt/anaconda3/envs/sparknlp/lib/python3.7/asyncio/events.py", line 88, in _run
    self._context.run(self._callback, *self._args)
RuntimeError: Cannot enter into task <Task pending coro=<HTTP1ServerConnection._server_request_loop() running at /Users/filotio/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/tornado/http1connection.py:823> wait_for=<Future finished result=b'GET /kernel...ab06"\r\n\r\n'> cb=[IOLoop.add_future.<locals>.<lambda>() at /Users/filotio/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/tornado/ioloop.py:688]> while another task <Task pending coro=<MultiKernelManager._async_start_kernel() running at /Users/filotio/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/jupyter_client/multikernelmanager.py:214>> is being executed.
ERROR:asyncio:Exception in callback <TaskWakeupMethWrapper object at 0x7f8007a2d250>(<Future finis...b06"\r\n\r\n'>)
handle: <Handle <TaskWakeupMethWrapper object at 0x7f8007a2d250>(<Future finis...b06"\r\n\r\n'>)>
Traceback (most recent call last):
  File "/Users/filotio/opt/anaconda3/envs/sparknlp/lib/python3.7/asyncio/events.py", line 88, in _run
    self._context.run(self._callback, *self._args)

And I get a core dump attached here hs_err_pid2416.log

maziyarpanahi commented 3 years ago

I’ll do some test in Python 3.8, the error shows 3.7? Does it have the same issue in 3.7?

also, the Java error is a generic Py4J layer error. The execution gets stuck when launching the Py4J gateway.

Do you have Java 1.8 installed and available on system path? (java -version)

Octavian-act commented 3 years ago

That was for reference and good to know: On Python 3.8.5 was not passing the spark session builder step. Will do some more test on Monday (VPN on/off - it is always on, but good to know). Yes, Java is 8, the java -version in notebook displays:


openjdk version "1.8.0_282"
OpenJDK Runtime Environment (build 1.8.0_282-bre_2021_01_20_16_37-b00)
OpenJDK 64-Bit Server VM (build 25.282-b00, mixed mode)
maziyarpanahi commented 3 years ago

Is there any setup that you can use the fat jar and use any pretrained model? Cause if this fails then I have to tell someone to contact you with shared screen, it’s not supposed to be this long and complicated. Even in highly restricted environments this takes up to 15 minutes to setup, your system needs a closer look.

Octavian-act commented 3 years ago

Great idea @maziyarpanahi : have a dev on your side looking at my environment. Please, use my work email: Octavian.Filoti@bms.com, and use Teams, Zoom, or Webex for invite. I am available all this afternoon, with a tiny gap: 14:30-14:45 EST.

maziyarpanahi commented 3 years ago

Sure @Octavian-act I asked someone to contact you regarding your setup.

Octavian-act commented 3 years ago

@maziyarpanahi no one has contacted me yet :-)

Octavian-act commented 3 years ago

@maziyarpanahi I still hope to be able to use JSL software. Please, have some one contacting me and test/debug further. Thank you.