JohnSnowLabs / spark-nlp

State of the Art Natural Language Processing
https://sparknlp.org/
Apache License 2.0
3.81k stars 708 forks source link

Cannot download Deep Learning models from SparkNLP model hub #14378

Open olivierr42 opened 2 weeks ago

olivierr42 commented 2 weeks ago

Is there an existing issue for this?

Who can help?

@maziyarpanahi I saw you answered to similar requests in the past. Thank you in advance.

What are you working on?

I am working with a in-house dataset. This is not an official exemple. I am trying to use this model specifically: https://sparknlp.org/api/python/reference/autosummary/sparknlp/annotator/embeddings/xlm_roberta_embeddings/index.html

I got the same issue when trying to load the SentenceDetectorDL model (mentioned on the Hub for this model)

Current Behavior

When I try to instantiate my pipeline:

  document_assembler = DocumentAssembler().setInputCol(input_col).setOutputCol("document")

  sentencer = SentenceDetector().setInputCols(["document"]).setOutputCol("sentence")

  embeddings = (
      XlmRoBertaSentenceEmbeddings.pretrained("multilingual_e5_base", "xx")
      .setInputCols(["sentence"])
      .setOutputCol(output_col)
  )

  pipeline = Pipeline().setStages([document_assembler, sentencer, embeddings])

I get the following error:

answer = 'xro63'
gateway_client = <py4j.clientserver.JavaClient object at 0x13f3dd710>
target_id = 'z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader'
name = 'downloadModel'

    def get_return_value(answer, gateway_client, target_id=None, name=None):
        """Converts an answer received from the Java gateway into a Python object.

        For example, string representation of integers are converted to Python
        integer, string representation of objects are converted to JavaObject
        instances, etc.

        :param answer: the string returned by the Java gateway
        :param gateway_client: the gateway client used to communicate with the Java
            Gateway. Only necessary if the answer is a reference (e.g., object,
            list, map)
        :param target_id: the name of the object from which the answer comes from
            (e.g., *object1* in `object1.hello()`). Optional.
        :param name: the name of the member from which the answer comes from
            (e.g., *hello* in `object1.hello()`). Optional.
        """
        if is_error(answer)[0]:
            if len(answer) > 1:
                type = answer[1]
                value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
                if answer[1] == REFERENCE_TYPE:
>                   raise Py4JJavaError(
                        "An error occurred while calling {0}{1}{2}.\n".
                        format(target_id, ".", name), value)
E                   py4j.protocol.Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel.
E                   : java.lang.UnsatisfiedLinkError: no jnitensorflow in java.library.path

Expected Behavior

I know support for M1 is experimental, but I would expect it not to crash. Especially since I am able to run Word2Vec models without issue.

Steps To Reproduce

  document_assembler = DocumentAssembler().setInputCol(input_col).setOutputCol("document")

  sentencer = SentenceDetector().setInputCols(["document"]).setOutputCol("sentence")

  embeddings = (
      XlmRoBertaSentenceEmbeddings.pretrained("multilingual_e5_base", "xx")
      .setInputCols(["sentence"])
      .setOutputCol(output_col)
  )

  pipeline = Pipeline().setStages([document_assembler, sentencer, embeddings])

Spark NLP version and Apache Spark

sparknlp = '5.3.3' pyspark = '3.5.1'

Type of Spark Application

Python Application

Java Version

java version "1.8.0_411"

Java Home Directory

/Library/Internet Plug-Ins/JavaAppletPlugin.plugin/Contents/Home

Setup and installation

poetry add sparknlp=5.3.3

Operating System and Version

Mac M1 Sonomo 14.5

Link to your project (if available)

No response

Additional Information

I do not have issue with Word2Vec models. I also tried with SParkNLP 5.4.1, to no avail.

maziyarpanahi commented 2 weeks ago

Hi @olivierr42

The support for Apple Silicon is experimental at this point. This is true for all the DL based models/annotators. The Word2Vec is pure written by using machine learning algorithm so it works independent of the operating system.

olivierr42 commented 2 weeks ago

It seems like the issue is with downloading the model. There seems to be a way to load the models from local storage, but I cannot seem to be able to make it work (it's trying to find a assets subfolder within the model folder, which does not exist if I download from the provided url).

Do you have any tips to make it work locally?

maziyarpanahi commented 2 weeks ago

What is the error when downloading models? You can always test it quickly in Google Colab to be sure whether it's the model or your environment.

Spark NLP works 100% offline, you can follow this instruction that shows how to download any model, extract it, and just use .load() instead of .pretrained(): https://sparknlp.org/docs/en/install#offline

PS: Your Spark application must have access to that local path