Closed Octavian-act closed 3 years ago
Hi,
As mentioned in the conversation, this is not an issue we can ever solve. If you don’t have access to Internet whether you are behind proxy or firewall, you cannot use Maven coordinates since it requires to download dependencies like any other Maven package. So you use Fat JAR. As I mentioned, you also meed access to S3 if you are using pretrained() because it needs to download models/pipelines. If you don’t, you need to download them, extract them and use .load() instead.
We have many users in Healthcare which they must be in air-gaped environment with zero access to internet. So they go with Fat JAR and offline .load().
Either choose Fat JAR and manually download/load models or please use Google Colab/Kaggle for testings. You cannot download via pretrained if you don’t have any access and we cannot do anything about that. (Either total offline solution, or ask your admin to whitelist some stuff for you)
: com.amazonawsShadedAmazonClientException: Unable to execute HTTP request: Connect to auxdata.johnsnowlabs.com.s3.amazonaws.com:443 timed out
I understand, but please, read carefully, you do have a bug: the latest release, 3.0.x does not work, I get a NullPointerException, regardless if I have my VPN enabled or disabled (see 2.2), whereas for 2.7.5 it works - it does download the pretrained model once I disabled the VPN, but this is not the case for your latest release (I pasted 3.0.1 above because that was the latest trial, but I did try 3.0.0 as well).
Fair enough. But locally and in the Colab I cannot reproduce this and I might have not read the whole post correctly since it's not formatted correctly:
Here is an example of what you are doing: https://colab.research.google.com/drive/1Jd6prxgvZdVORLTn0u-VfPTS3Ex_rxWt?usp=sharing
As you can see it works just fine, nothing has changed from 2.7.x to 3.x regarding pretrained models and our Fat JARs. The only thing I can find in there is that you constantly referring the followings together:
pip install spark-nlp==2.7.5 pyspark==2.4.4
and
.config("spark.jars", "/Users/filotio/Downloads/spark-nlp-assembly-3.0.1.jar")
If you are using spark-nlp-assembly
you must install pyspark 3.x, and if you are using a jar that has 3.0.1
you also have to install spark-nlp==3.0.1 as well. Maybe it's a copy/paste mistake but be sure you have a clean environment, properly upgrading pyspark to 3.x, properly upgrading spark-nlp to 3.0.1 before using the spark-nlp-assembly-3.0.1.jar
.
I have two separate clean environments for both 2.7.5 and 3.0.1 versions, so no mixture there. Following exactly your example above: https://colab.research.google.com/drive/1Jd6prxgvZdVORLTn0u-VfPTS3Ex_rxWt?usp=sharing I still get NullPointerException on my macos:
! java -version ! echo $JAVA_HOME ! which java ! SPARKHOME="/Users/filotio/Downloads/spark-3.1.1-bin-hadoop2.7" ! export SPARK_HOME=$SPARKHOME
getting: openjdk version "1.8.0_282" OpenJDK Runtime Environment (build 1.8.0_282-bre_2021_01_20_16_37-b00) OpenJDK 64-Bit Server VM (build 25.282-b00, mixed mode) /Library/Java/JavaVirtualMachines/openjdk-8.jdk/Contents/Home /usr/local/opt/openjdk@8/bin/java
import sparknlp from pyspark.sql import SparkSession spark = SparkSession.builder\ .appName("Spark NLP")\ .master("local[*]")\ .config("spark.driver.memory","16G")\ .config("spark.driver.maxResultSize", "0") \ .config("spark.kryoserializer.buffer.max", "2000M")\ .config("spark.jars", "/Users/filotio/Downloads/spark-nlp-assembly-3.0.1.jar")\ .getOrCreate() print("Apache Spark version:", spark.version) print("Spark NLP version", sparknlp.version()) spark
getting: Apache Spark version: 3.1.1 Spark NLP version 3.0.1
SparkSession - in-memory
SparkContext
Spark UI
Version v3.1.1 Master local[*] AppName Spark NLP
and now when:
from sparknlp.base import from sparknlp.annotator import
sentencerDL = SentenceDetectorDLModel\ .pretrained("sentence_detector_dl", "en") \ .setInputCols(["document"]) \ .setOutputCol("sentences")
getting: sentence_detector_dl download started this may take some time.
Py4JJavaError Traceback (most recent call last)
I only shared Colab because it's the only environment that is the same for anyone in the entire world. Also, it's the only thing I can share to show you there is no issue with Fat JAR and pretrained models. My primary laptop is a MacBook Pro and I have been using macOS for the last 10 years for all of my projects. (also macOS is Unix based, so technically what works in Linux it works in macOS, Windows is not a friendly environment when it comes to Spark)
Anyway, I couldn't reproduce this issue on Ubuntu 16, 18, 20, macOS Big Sur, and Windows 10! Please let me know if I miss anything. If I can find a firewall or a proxy I would try that, but I could only suggest:
These are not a final solution obviously, but to zeroing down to an actual issue (regardless of who should fix it) so I can reproduce it. Just a Fat JAR and .pretrained() has never been an issue in any environment.
One more thing, in 2.7.5, the aws-java-sdk was 1.7.4 which was really old. In 3.0.1 we are using 1.11.603 which might have some bugs when it is used with proxy.
If you are interested, I can make a Fat JAR with a newer aws-java-sdk and let you test it. Would you be interested?
Ohh, nice, yes I am interested, if you can build that, let me know. As you can imagine, my work macos has extra protections compare to "normal" one, and since the devops team was slow in enabling SparkNLP on AWS, I am trying to test locally and help them :-)
Very nice, since you have an environment that can reproduce this error so I'll send you a Fat JAR to test on Slack. I think the issue is this: https://github.com/aws/aws-sdk-java/issues/2070 (apparently it happened only in 1.11.603) It was also reported here a while back: https://github.com/JohnSnowLabs/spark-nlp/issues/1174
Yes, most likely seems to be that issue. We do set those proxy env. vars. I do not have any issues on my private macos, have just tried (was harder to install Java 8) :-)
@maziyarpanahi asking here (Slack unavailable behind proxy :-)): https://nlp.johnsnowlabs.com/docs/en/models How can I download the models, from where (for offline use)? Many thanks
Oh, I sent you the Fat Jar via Slack. I'll put it on S3 and share the link. All of our models/pipelines are here: https://nlp.johnsnowlabs.com/models
They all have examples and download link inside.
Tried on AWS, not working, but I cannot change the SparkNLP version:
import os os.environ["JAVA_HOME"] = "/home/ec2-user/anaconda3/envs/JupyterSystemEnv" os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
! java -version ! echo $JAVA_HOME ! which java
! SPARKHOME="/home/ec2-user/SageMaker/spark-3.1.1-bin-hadoop2.7" ! export SPARK_HOME=$SPARKHOME
! pip install --upgrade pyspark==3.1.1 spark-nlp==3.0.1 findspark ! pip install --upgrade spark-nlp==3.0.1 ! pip install --upgrade findspark`
response:
Loaded plugins: dkms-build-requires, priorities, update-motd, upgrade-helper, : versionlock amzn-main | 2.1 kB 00:00 amzn-updates | 3.8 kB 00:00 Package 1:java-1.8.0-openjdk-1.8.0.282.b08-1.61.amzn1.x86_64 already installed and latest version Nothing to do openjdk version "1.8.0_265" OpenJDK Runtime Environment (Zulu 8.48.0.53-CA-linux64) (build 1.8.0_265-b11) OpenJDK 64-Bit Server VM (Zulu 8.48.0.53-CA-linux64) (build 25.265-b11, mixed mode) /home/ec2-user/anaconda3/envs/JupyterSystemEnv /home/ec2-user/anaconda3/envs/JupyterSystemEnv/bin/java Requirement already satisfied: pyspark==3.1.1 in /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages (3.1.1) Requirement already satisfied: py4j==0.10.9 in /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages (from pyspark==3.1.1) (0.10.9) Requirement already satisfied: spark-nlp==3.0.1 in /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages (3.0.1) Requirement already satisfied: findspark in /home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages (1.4.2)
and now for: `from pyspark.sql import SparkSession import sparknlp
! ls -ltr /home/ec2-user/SageMaker/spark-nlp-assembly-aws-fix.jar print("Spark NLP version", sparknlp.version())
spark = SparkSession.builder\ .appName("Spark NLP")\ .master("local[*]")\ .config("spark.driver.memory","16G")\ .config("spark.driver.maxResultSize", "0") \ .config("spark.kryoserializer.buffer.max", "2000M")\ .config("spark.jars", "/home/ec2-user/SageMaker/spark-nlp-assembly-aws-fix.jar")\ .getOrCreate()
print("Apache Spark version:", spark.version) spark`
I am getting Spark NLP version 2.7.4 and cannot figure why: `-rw-rw-r-- 1 ec2-user ec2-user 453870407 Apr 9 22:19 /home/ec2-user/SageMaker/spark-nlp-assembly-aws-fix.jar Spark NLP version 2.7.4
Exception Traceback (most recent call last)
For my work macos:
when calling:
from sparknlp.base import * from sparknlp.annotator import * sentencerDL = SentenceDetectorDLModel.load("/Users/filotio/Documents/BMS/Vendors/JSL/Models/sentence_detector_dl_en_2.7.0_2.4_1609611052663")\ .setInputCols(["document"])\ .setOutputCol("sentences")
getting back: `ERROR:root:Exception while sending command. Traceback (most recent call last): File "/Users/filotio/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/py4j/java_gateway.py", line 1207, in send_command raise Py4JNetworkError("Answer from Java side is empty") py4j.protocol.Py4JNetworkError: Answer from Java side is empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/Users/filotio/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/py4j/java_gateway.py", line 1033, in send_command response = connection.send_command(command) File "/Users/filotio/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/py4j/java_gateway.py", line 1212, in send_command "Error while receiving", e, proto.ERROR_ON_RECEIVE) py4j.protocol.Py4JNetworkError: Error while receiving
Py4JError Traceback (most recent call last)
Hi, Let’s focus on one issue at a time. The first one is aws, second one is .load. (we can get to these later)
Could you please try the custom Fat JAR to see if the initial reported error has been solved by using pretrained() like it used to work in 2.7.5?
When using Conda base env (Python 3.8.5) and not sparknlp (Python 3.7.10), it fails in:
spark = SparkSession.builder\
.appName("Spark NLP")\
.master("local[*]")\
.config("spark.driver.memory","16G")\
.config("spark.driver.maxResultSize", "0") \
.config("spark.kryoserializer.buffer.max", "2000M")\
.config("spark.jars", "/Users/filotio/Downloads/spark-nlp-assembly-aws-fix.jar")\
.getOrCreate()
Using sparknlp conda env, which has:
openjdk version "1.8.0_282"
OpenJDK Runtime Environment (build 1.8.0_282-bre_2021_01_20_16_37-b00)
OpenJDK 64-Bit Server VM (build 25.282-b00, mixed mode)
Spark NLP version 3.0.1
Apache Spark version: 3.1.1
When executing:
sentencerDL = SentenceDetectorDLModel\
.pretrained("sentence_detector_dl", "en") \
.setInputCols(["document"]) \
.setOutputCol("sentences")
I get back:
sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[ \ ]
ERROR:root:Exception while sending command.
Traceback (most recent call last):
File "/Users/filotio/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/py4j/java_gateway.py", line 1207, in send_command
raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/filotio/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/py4j/java_gateway.py", line 1033, in send_command
response = connection.send_command(command)
File "/Users/filotio/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/py4j/java_gateway.py", line 1212, in send_command
"Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving
[OK!]
---------------------------------------------------------------------------
Py4JError Traceback (most recent call last)
<ipython-input-3-0ddcc01e0819> in <module>
4
5 sentencerDL = SentenceDetectorDLModel\
----> 6 .pretrained("sentence_detector_dl", "en") \
7 .setInputCols(["document"]) \
8 .setOutputCol("sentences")
~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/sparknlp/annotator.py in pretrained(name, lang, remote_loc)
3107 def pretrained(name="sentence_detector_dl", lang="en", remote_loc=None):
3108 from sparknlp.pretrained import ResourceDownloader
-> 3109 return ResourceDownloader.downloadModel(SentenceDetectorDLModel, name, lang, remote_loc)
3110
3111
~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/sparknlp/pretrained.py in downloadModel(reader, name, language, remote_loc, j_dwn)
39 t1.start()
40 try:
---> 41 j_obj = _internal._DownloadModel(reader.name, name, language, remote_loc, j_dwn).apply()
42 finally:
43 stop_threads = True
~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/sparknlp/internal.py in __init__(self, reader, name, language, remote_loc, validator)
174 class _DownloadModel(ExtendedJavaWrapper):
175 def __init__(self, reader, name, language, remote_loc, validator):
--> 176 super(_DownloadModel, self).__init__("com.johnsnowlabs.nlp.pretrained."+validator+".downloadModel", reader, name, language, remote_loc)
177
178
~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/sparknlp/internal.py in __init__(self, java_obj, *args)
127 super(ExtendedJavaWrapper, self).__init__(java_obj)
128 self.sc = SparkContext._active_spark_context
--> 129 self._java_obj = self.new_java_obj(java_obj, *args)
130 self.java_obj = self._java_obj
131
~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/sparknlp/internal.py in new_java_obj(self, java_class, *args)
137
138 def new_java_obj(self, java_class, *args):
--> 139 return self._new_java_obj(java_class, *args)
140
141 def new_java_array(self, pylist, java_class):
~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/pyspark/ml/wrapper.py in _new_java_obj(java_class, *args)
64 java_obj = getattr(java_obj, name)
65 java_args = [_py2java(sc, arg) for arg in args]
---> 66 return java_obj(*java_args)
67
68 @staticmethod
~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/py4j/java_gateway.py in __call__(self, *args)
1303 answer = self.gateway_client.send_command(command)
1304 return_value = get_return_value(
-> 1305 answer, self.gateway_client, self.target_id, self.name)
1306
1307 for temp_arg in temp_args:
~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
109 def deco(*a, **kw):
110 try:
--> 111 return f(*a, **kw)
112 except py4j.protocol.Py4JJavaError as e:
113 converted = convert_exception(e.java_exception)
~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
334 raise Py4JError(
335 "An error occurred while calling {0}{1}{2}".
--> 336 format(target_id, ".", name))
337 else:
338 type = answer[1]
Py4JError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel
and from the terminal I see:
ERROR:asyncio:Exception in callback <TaskWakeupMethWrapper object at 0x7f8007a48510>(<Future finis...b06"\r\n\r\n'>)
handle: <Handle <TaskWakeupMethWrapper object at 0x7f8007a48510>(<Future finis...b06"\r\n\r\n'>)>
Traceback (most recent call last):
File "/Users/filotio/opt/anaconda3/envs/sparknlp/lib/python3.7/asyncio/events.py", line 88, in _run
self._context.run(self._callback, *self._args)
RuntimeError: Cannot enter into task <Task pending coro=<HTTP1ServerConnection._server_request_loop() running at /Users/filotio/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/tornado/http1connection.py:823> wait_for=<Future finished result=b'GET /kernel...ab06"\r\n\r\n'> cb=[IOLoop.add_future.<locals>.<lambda>() at /Users/filotio/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/tornado/ioloop.py:688]> while another task <Task pending coro=<MultiKernelManager._async_start_kernel() running at /Users/filotio/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/jupyter_client/multikernelmanager.py:214>> is being executed.
ERROR:asyncio:Exception in callback <TaskWakeupMethWrapper object at 0x7f8007a2d250>(<Future finis...b06"\r\n\r\n'>)
handle: <Handle <TaskWakeupMethWrapper object at 0x7f8007a2d250>(<Future finis...b06"\r\n\r\n'>)>
Traceback (most recent call last):
File "/Users/filotio/opt/anaconda3/envs/sparknlp/lib/python3.7/asyncio/events.py", line 88, in _run
self._context.run(self._callback, *self._args)
And I get a core dump attached here hs_err_pid2416.log
I’ll do some test in Python 3.8, the error shows 3.7? Does it have the same issue in 3.7?
also, the Java error is a generic Py4J layer error. The execution gets stuck when launching the Py4J gateway.
Do you have Java 1.8 installed and available on system path? (java -version)
That was for reference and good to know: On Python 3.8.5 was not passing the spark session builder step. Will do some more test on Monday (VPN on/off - it is always on, but good to know). Yes, Java is 8, the java -version in notebook displays:
openjdk version "1.8.0_282"
OpenJDK Runtime Environment (build 1.8.0_282-bre_2021_01_20_16_37-b00)
OpenJDK 64-Bit Server VM (build 25.282-b00, mixed mode)
Is there any setup that you can use the fat jar and use any pretrained model? Cause if this fails then I have to tell someone to contact you with shared screen, it’s not supposed to be this long and complicated. Even in highly restricted environments this takes up to 15 minutes to setup, your system needs a closer look.
Great idea @maziyarpanahi : have a dev on your side looking at my environment. Please, use my work email: Octavian.Filoti@bms.com, and use Teams, Zoom, or Webex for invite. I am available all this afternoon, with a tiny gap: 14:30-14:45 EST.
Sure @Octavian-act I asked someone to contact you regarding your setup.
@maziyarpanahi no one has contacted me yet :-)
@maziyarpanahi I still hope to be able to use JSL software. Please, have some one contacting me and test/debug further. Thank you.
Description
I have started this conversation:
https://spark-nlp.slack.com/archives/CA118BWRM/p1617225602087300
and based on the response, I have tried fat jars on my work laptop. Using the Fat Jars, it did move pass the starting session step, but it failed short in sentence detection, and there are big differences between spark-nlp 2.7.x and 3.0.x, as detailed below:
1.1. On Spark NLP version 2.7.5: got a timeout when company's VPN is enabled (on my work MACOS laptop):
Apache Spark version: 2.4.4 Spark NLP version 2.7.5 sentence_detector_dl download started this may take some time.
Py4JJavaError Traceback (most recent call last)