JohnSnowLabs / spark-nlp

State of the Art Natural Language Processing
https://sparknlp.org/
Apache License 2.0
3.87k stars 712 forks source link

Export from HuggingFace BERT for Sequence Classification didn't work in AWS SageMaker #6436

Closed xegulon closed 3 years ago

xegulon commented 3 years ago

Hi, I followed the official notebook for converting a HuggingFace model to Spark NLP format (BERT for Sequence Classification). But I get a java error.

Description

Everything goes well, until I reach this cell of the official notebook:

import sparknlp
from sparknlp.annotator import *

spark = sparknlp.start(gpu=True)
sequenceClassifier = BertForSequenceClassification.loadSavedModel(
     '{}/saved_model/1'.format(MODEL_NAME),
     spark
 )\
  .setInputCols(["sentence",'token'])\
  .setOutputCol("class")\
  .setCaseSensitive(True)\
  .setMaxSentenceLength(128)

Expected Behavior

It shouldn't crash at me.

Current Behavior

I get the following error:

RuntimeError                              Traceback (most recent call last)
<ipython-input-20-16aa03d20195> in <module>
      2 from sparknlp.annotator import *
      3 
----> 4 spark = sparknlp.start(gpu=True)
      5 sequenceClassifier = BertForSequenceClassification.loadSavedModel(
      6      '{}/saved_model/1'.format(MODEL_NAME),

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/sparknlp/__init__.py in start(gpu, spark23, spark24, memory, cache_folder, log_folder, cluster_tmp_dir, real_time_output, output_level)
    256             return SparkRealTimeOutput()
    257     else:
--> 258         spark_session = start_without_realtime_output()
    259         return spark_session
    260 

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/sparknlp/__init__.py in start_without_realtime_output()
    164             builder.config("spark.jsl.settings.storage.cluster_tmp_dir", cluster_tmp_dir)
    165 
--> 166         return builder.getOrCreate()
    167 
    168     def start_with_realtime_output():

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/pyspark/sql/session.py in getOrCreate(self)
    226                             sparkConf.set(key, value)
    227                         # This SparkContext may be an existing one.
--> 228                         sc = SparkContext.getOrCreate(sparkConf)
    229                     # Do not update `SparkConf` for existing `SparkContext`, as it's shared
    230                     # by all sessions.

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/pyspark/context.py in getOrCreate(cls, conf)
    390         with SparkContext._lock:
    391             if SparkContext._active_spark_context is None:
--> 392                 SparkContext(conf=conf or SparkConf())
    393             return SparkContext._active_spark_context
    394 

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/pyspark/context.py in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls)
    142                 " is not allowed as it is a security risk.")
    143 
--> 144         SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
    145         try:
    146             self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/pyspark/context.py in _ensure_initialized(cls, instance, gateway, conf)
    337         with SparkContext._lock:
    338             if not SparkContext._gateway:
--> 339                 SparkContext._gateway = gateway or launch_gateway(conf)
    340                 SparkContext._jvm = SparkContext._gateway.jvm
    341 

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/pyspark/java_gateway.py in launch_gateway(conf, popen_kwargs)
    106 
    107             if not os.path.isfile(conn_info_file):
--> 108                 raise RuntimeError("Java gateway process exited before sending its port number")
    109 
    110             with open(conn_info_file, "rb") as info:

RuntimeError: Java gateway process exited before sending its port number

In fact, I think the problem comes from the instruction: spark = sparknlp.start(gpu=True).

Possible Solution

Do not know.

Steps to Reproduce

Read above.

Context

Trying to reproduce the script with my own model.

Your Environment

I installed spark and spark nlp using simple pip install.

maziyarpanahi commented 3 years ago
xegulon commented 3 years ago

Thanks a lot, I think it could be great to add the sagemaker setup script along with the kaggle and colab ones.

:+1:

maziyarpanahi commented 3 years ago

@xegulon That makes total sense! I'll add this to the list so we make a nice step-by-step setup in the repo and on the Website.

xegulon commented 3 years ago

I could do a PR for that, it could be good training for me ;)

maziyarpanahi commented 3 years ago

That'd be great! Thank you @xegulon, step by step instructions coming from users are always helpful since they have a first-hand experience.

xegulon commented 3 years ago

I'll do it then!

xegulon commented 3 years ago

It's here: https://github.com/JohnSnowLabs/spark-nlp/pull/6449

maziyarpanahi commented 3 years ago

Many thanks! I'll ask to also have a short DNS path like the one we have for Colab and Kaggle in the next release.

xegulon commented 3 years ago

Yeah, that'd be convenient! Should I modify anything in the code I've made?

maziyarpanahi commented 3 years ago

No they will make a redirect DNS subdomain, no need to change anything in the script it looks sufficient already. Many thanks again @xegulon