JohnSnowLabs / spark-nlp

State of the Art Natural Language Processing
https://sparknlp.org/
Apache License 2.0
3.81k stars 708 forks source link

unable to load spark-nlp pretrained models in a spark session in Jupiter notebooks #9262

Closed tomerenshtein closed 2 years ago

tomerenshtein commented 2 years ago

unable to load pretrained models with the .pretrained() method.

Description

when trying to load a pretrained model as in -

bert = BertEmbeddings.pretrained().setInputCols(["sentence","token"])\
.setOutputCol("bert")\
.setCaseSensitive(False)

the load process seems to be running (cursor spinning [ / ]) but load can run for hours and nothing happens

not sure if related - but getting this messages when starting spark session (when preforming other park tasks like just manipulating dataframes it did not make a difference)-

22/06/10 21:25:32 WARN Utils: Your hostname, tomershtein-Vostro-5490 resolves to a loopback address: 127.0.1.1; using 192.168.0.24 instead (on interface wlp0s20f3)
22/06/10 21:25:32 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/home/tomershtein/projects/ldgn-websites/venv/lib/python3.8/site-packages/pyspark/jars/spark-unsafe_2.12-3.1.2.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Ivy Default Cache set to: /home/tomershtein/.ivy2/cache
The jars for the packages stored in: /home/tomershtein/.ivy2/jars
com.johnsnowlabs.nlp#spark-nlp_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-09520e50-0003-4e81-9af7-662705d49c6d;1.0
    confs: [default]
:: loading settings :: url = jar:file:/home/tomershtein/projects/ldgn-websites/venv/lib/python3.8/site-packages/pyspark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
    found com.johnsnowlabs.nlp#spark-nlp_2.12;3.4.4 in central
    found com.typesafe#config;1.4.2 in central
    found org.rocksdb#rocksdbjni;6.5.3 in central
    found com.amazonaws#aws-java-sdk-bundle;1.11.603 in central
    found com.github.universal-automata#liblevenshtein;3.0.0 in central
    found com.google.code.findbugs#annotations;3.0.1 in central
    found net.jcip#jcip-annotations;1.0 in central
    found com.google.code.findbugs#jsr305;3.0.1 in central
    found com.google.protobuf#protobuf-java-util;3.0.0-beta-3 in central
    found com.google.protobuf#protobuf-java;3.0.0-beta-3 in central
    found com.google.code.gson#gson;2.3 in central
    found it.unimi.dsi#fastutil;7.0.12 in central
    found org.projectlombok#lombok;1.16.8 in central
    found org.slf4j#slf4j-api;1.7.21 in central
    found com.navigamez#greex;1.0 in central
    found dk.brics.automaton#automaton;1.11-8 in central
    found org.json4s#json4s-ext_2.12;3.5.3 in central
    found joda-time#joda-time;2.9.5 in central
    found org.joda#joda-convert;1.8.1 in central
    found com.johnsnowlabs.nlp#tensorflow-cpu_2.12;0.3.3 in central
:: resolution report :: resolve 426ms :: artifacts dl 10ms
    :: modules in use:
    com.amazonaws#aws-java-sdk-bundle;1.11.603 from central in [default]
    com.github.universal-automata#liblevenshtein;3.0.0 from central in [default]
    com.google.code.findbugs#annotations;3.0.1 from central in [default]
    com.google.code.findbugs#jsr305;3.0.1 from central in [default]
    com.google.code.gson#gson;2.3 from central in [default]
    com.google.protobuf#protobuf-java;3.0.0-beta-3 from central in [default]
    com.google.protobuf#protobuf-java-util;3.0.0-beta-3 from central in [default]
    com.johnsnowlabs.nlp#spark-nlp_2.12;3.4.4 from central in [default]
    com.johnsnowlabs.nlp#tensorflow-cpu_2.12;0.3.3 from central in [default]
    com.navigamez#greex;1.0 from central in [default]
    com.typesafe#config;1.4.2 from central in [default]
    dk.brics.automaton#automaton;1.11-8 from central in [default]
    it.unimi.dsi#fastutil;7.0.12 from central in [default]
    joda-time#joda-time;2.9.5 from central in [default]
    net.jcip#jcip-annotations;1.0 from central in [default]
    org.joda#joda-convert;1.8.1 from central in [default]
    org.json4s#json4s-ext_2.12;3.5.3 from central in [default]
    org.projectlombok#lombok;1.16.8 from central in [default]
    org.rocksdb#rocksdbjni;6.5.3 from central in [default]
    org.slf4j#slf4j-api;1.7.21 from central in [default]
    ---------------------------------------------------------------------
    |                  |            modules            ||   artifacts   |
    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
    ---------------------------------------------------------------------
    |      default     |   20  |   0   |   0   |   0   ||   20  |   0   |
    ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-09520e50-0003-4e81-9af7-662705d49c6d
    confs: [default]
    0 artifacts copied, 20 already retrieved (0kB/13ms)
22/06/10 21:25:33 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

Expected Behavior

the model should load I guess :)

Current Behavior

load can run for hours and nothing happens. When interrupted - shows the following -

small_bert_L2_768 download started this may take some time.
Approximate size to download 139.6 MB
[ / ]small_bert_L2_768 download started this may take some time.
Approximate size to download 139.6 MB
[OK!]
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Input In [13], in <cell line: 1>()
----> 1 bert = BertEmbeddings.pretrained().setInputCols(["sentence","token"])\
      2 .setOutputCol("bert")\
      3 .setCaseSensitive(False)

File ~/projects/ldgn-websites/venv/lib/python3.8/site-packages/sparknlp/annotator.py:6631, in BertEmbeddings.pretrained(name, lang, remote_loc)
   6613 """Downloads and loads a pretrained model.
   6614 
   6615 Parameters
   (...)
   6628     The restored model
   6629 """
   6630 from sparknlp.pretrained import ResourceDownloader
-> 6631 return ResourceDownloader.downloadModel(BertEmbeddings, name, lang, remote_loc)

File ~/projects/ldgn-websites/venv/lib/python3.8/site-packages/sparknlp/pretrained.py:59, in ResourceDownloader.downloadModel(reader, name, language, remote_loc, j_dwn)
     57 t1.start()
     58 try:
---> 59     j_obj = _internal._DownloadModel(reader.name, name, language, remote_loc, j_dwn).apply()
     60 except Py4JJavaError as e:
     61     sys.stdout.write("\n" + str(e))

File ~/projects/ldgn-websites/venv/lib/python3.8/site-packages/sparknlp/internal.py:213, in _DownloadModel.__init__(self, reader, name, language, remote_loc, validator)
    212 def __init__(self, reader, name, language, remote_loc, validator):
--> 213     super(_DownloadModel, self).__init__("com.johnsnowlabs.nlp.pretrained." + validator + ".downloadModel", reader,
    214                                          name, language, remote_loc)

File ~/projects/ldgn-websites/venv/lib/python3.8/site-packages/sparknlp/internal.py:165, in ExtendedJavaWrapper.__init__(self, java_obj, *args)
    163 super(ExtendedJavaWrapper, self).__init__(java_obj)
    164 self.sc = SparkContext._active_spark_context
--> 165 self._java_obj = self.new_java_obj(java_obj, *args)
    166 self.java_obj = self._java_obj

File ~/projects/ldgn-websites/venv/lib/python3.8/site-packages/sparknlp/internal.py:175, in ExtendedJavaWrapper.new_java_obj(self, java_class, *args)
    174 def new_java_obj(self, java_class, *args):
--> 175     return self._new_java_obj(java_class, *args)

File ~/projects/ldgn-websites/venv/lib/python3.8/site-packages/pyspark/ml/wrapper.py:66, in JavaWrapper._new_java_obj(java_class, *args)
     64     java_obj = getattr(java_obj, name)
     65 java_args = [_py2java(sc, arg) for arg in args]
---> 66 return java_obj(*java_args)

File ~/projects/ldgn-websites/venv/lib/python3.8/site-packages/py4j/java_gateway.py:1303, in JavaMember.__call__(self, *args)
   1296 args_command, temp_args = self._build_args(*args)
   1298 command = proto.CALL_COMMAND_NAME +\
   1299     self.command_header +\
   1300     args_command +\
   1301     proto.END_COMMAND_PART
-> 1303 answer = self.gateway_client.send_command(command)
   1304 return_value = get_return_value(
   1305     answer, self.gateway_client, self.target_id, self.name)
   1307 for temp_arg in temp_args:

File ~/projects/ldgn-websites/venv/lib/python3.8/site-packages/py4j/java_gateway.py:1033, in GatewayClient.send_command(self, command, retry, binary)
   1031 connection = self._get_connection()
   1032 try:
-> 1033     response = connection.send_command(command)
   1034     if binary:
   1035         return response, self._create_connection_guard(connection)

File ~/projects/ldgn-websites/venv/lib/python3.8/site-packages/py4j/java_gateway.py:1200, in GatewayConnection.send_command(self, command)
   1196     raise Py4JNetworkError(
   1197         "Error while sending", e, proto.ERROR_ON_SEND)
   1199 try:
-> 1200     answer = smart_decode(self.stream.readline()[:-1])
   1201     logger.debug("Answer received: {0}".format(answer))
   1202     if answer.startswith(proto.RETURN_MESSAGE):

File /usr/lib/python3.8/socket.py:669, in SocketIO.readinto(self, b)
    667 while True:
    668     try:
--> 669         return self._sock.recv_into(b)
    670     except timeout:
    671         self._timeout_occurred = True

KeyboardInterrupt: 

Possible Solution

Steps to Reproduce

from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *
import sparknlp

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.environ["SPARK_HOME"] = "/home/tomershtein/projects/ldgn-websites/venv/lib/python3.8/site-packages/pyspark"

import findspark
findspark.init()
from pyspark.sql import SparkSession

spark = SparkSession.builder \
.appName("Spark NLP")\
.master("local[*]")\
.config("spark.driver.memory","16G")\
.config("spark.driver.maxResultSize", "0") \
.config("spark.kryoserializer.buffer.max", "2000M")\
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.4")\
.getOrCreate()

spark.conf.set("spark.sql.repl.eagerEval.enabled", True)

bert = BertEmbeddings.pretrained().setInputCols(["sentence","token"])\
.setOutputCol("bert")\
.setCaseSensitive(False)

Context

working on a poc for an NER model. currently totally stuck.

Your Environment


* Setup and installation (Pypi, Conda, Maven, etc.): pip
* Operating System and version: ubuntu 20.4
* Link to your project (if any):

<!--- Please complete this template with required information for us to be able to reproduce it -->
<!--- If you are reporting an issue, failing to complete this template will result in closing the issue -->
maziyarpanahi commented 2 years ago

There should be an issue either with your network/firewall, storage/permission, or any other number of configs that is related to your system.

As you can see in Colab which uses Java 11 and Ubuntu 20.4 (and many other places) it takes less than a minute to download and load the default BERT model: https://colab.research.google.com/drive/1coAmaqk4A1YvQ6Md38jjXDehHnPyJCID?usp=sharing

I would first check and clean ~/cache_pretrained directory and make sure you carefully look into all the logs to find the actual problem.