JohnSnowLabs / johnsnowlabs

Gateway into the John Snow Labs Ecosystem
https://nlp.johnsnowlabs.com
Apache License 2.0
54 stars 26 forks source link

An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel. : org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: #1011

Open behnazeslami opened 7 months ago

behnazeslami commented 7 months ago

Hi, In my CentOS Linux, I installed: 1- ! pip install --upgrade -q pyspark==3.4.1 spark-nlp==5.2.2

2- ! pip install --upgrade spark-nlp-jsl==5.2.1 --user --extra-index-url https://pypi.johnsnowlabs.com/[secret_code] I checked the java --version: java -version

openjdk version "11.0.13" 2021-10-19 OpenJDK Runtime Environment JBR-11.0.13.7-1751.21-jcef (build 11.0.13+7-b1751.21) OpenJDK 64-Bit Server VM JBR-11.0.13.7-1751.21-jcef (build 11.0.13+7-b1751.21, mixed mode)

in the ~/.bashrc: JAVA_HOME: export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.332.b09-1.el7_9.x86_64 I am trying to run the following program:

` import sparknlp import sparknlp_jsl

from sparknlp.base import from sparknlp.util import from sparknlp.annotator import from sparknlp_jsl.annotator import

from pyspark.sql import SparkSession from pyspark.sql import functions as F from pyspark.ml import Pipeline, PipelineModel from pyspark.sql.functions import col from pyspark.sql.functions import explode

from sparknlp.pretrained import PretrainedPipeline

import gc

import pandas as pd pd.set_option('display.max_columns', None) pd.set_option('display.expand_frame_repr', False) pd.set_option('max_colwidth', None)

import string import numpy as np

%%

params = {"spark.driver.memory":"50G", "spark.kryoserializer.buffer.max":"2000M", "spark.serializer": "org.apache.spark.serializer.KryoSerializer", "spark.driver.maxResultSize":"16G"}

spark = sparknlp_jsl.start(license_keys['SECRET'], params=params, gpu=True)

print ("Spark NLP Version :", sparknlp.version()) print ("Spark NLP_JSL Version :", sparknlp_jsl.version())

print(spark) print("\n========================================================================") document = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")

sentenceDetector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence")

token = Tokenizer()\ .setInputCols(['sentence'])\ .setOutputCol('token')

embeddings = BertEmbeddings.pretrained("biobert_pubmed_base_cased")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings")

clinical_ner = MedicalNerModel.pretrained("ner_clinical_biobert", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner")

ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk")

clinical_assertion = AssertionDLModel.pretrained("assertion_dl_biobert_scope_L10R10","en", "clinical/models") \ .setInputCols(["sentence", "ner_chunk", "embeddings"]) \ .setOutputCol("assertion")

chunk2doc = Chunk2Doc() \ .setInputCols("ner_chunk") \ .setOutputCol("ner_chunk_doc")

sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings")

snomed_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_snomed_findings_aux_concepts", "en", "clinical/models") \ .setInputCols(["sbert_embeddings"]) \ .setOutputCol("snomed_code")\ .setDistanceFunction("COSINE")\ .setCaseSensitive(False)\ .setUseAuxLabel(True)\ .setNeighbours(10)

resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_umls_findings","en", "clinical/models") \ .setInputCols(["ner_chunk", "sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN")

nlpPipeline = Pipeline(stages=[document, sentenceDetector, token, embeddings, clinical_ner, ner_converter, clinical_assertion, chunk2doc, sbert_embedder, snomed_resolver, resolver])

data = spark.createDataFrame([[""]]).toDF("text")

assertion_model = nlpPipeline.fit(data)`

However, I get the following error:

    ---------------------------------------------------------------------
    |                  |            modules            ||   artifacts   |
    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
    ---------------------------------------------------------------------
    |      default     |   75  |   0   |   0   |   3   ||   72  |   0   |
    ---------------------------------------------------------------------

:: retrieving :: org.apache.spark#spark-submit-parent-b59223ac-26d8-44de-a4c3-d05a558c3faf confs: [default] 0 artifacts copied, 72 already retrieved (0kB/31ms) 24/03/06 20:34:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark NLP Version : 5.2.2 Spark NLP_JSL Version : 5.2.1 <pyspark.sql.session.SparkSession object at 0x7f5597073190>

======================================================================== biobert_pubmed_base_cased download started this may take some time. Approximate size to download 386.4 MB [ | ]biobert_pubmed_base_cased download started this may take some time. Approximate size to download 386.4 MB [ / ]Download done! Loading the resource.

An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel. : org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/beslami/cache_pretrained/biobert_pubmed_base_cased_en_2.6.2_2.4_1600449177996/metadata at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:304) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:244) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:332) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:208) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:291) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:291) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:287) at org.apache.spark.rdd.RDD.$anonfun$take$1(RDD.scala:1441) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:405) at org.apache.spark.rdd.RDD.take(RDD.scala:1435) at org.apache.spark.rdd.RDD.$anonfun$first$1(RDD.scala:1476) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:405) at org.apache.spark.rdd.RDD.first(RDD.scala:1476) at org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:587) at org.apache.spark.ml.util.DefaultParamsReader.load(ReadWrite.scala:465) at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:31) at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:24) at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:513) at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:505) at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.downloadModel(ResourceDownloader.scala:705) at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel(ResourceDownloader.scala) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: java.io.IOException: Input path does not exist: file:/home/beslami/cache_pretrained/biobert_pubmed_base_cased_en_2.6.2_2.4_1600449177996/metadata at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:278) ... 40 more [OK!] Traceback (most recent call last): File "/data/beslami/sample_loaded_models.py", line 75, in embeddings = BertEmbeddings.pretrained("biobert_pubmed_base_cased")\ File "/home/beslami/.local/lib/python3.9/site-packages/sparknlp/annotator/embeddings/bert_embeddings.py", line 206, in pretrained return ResourceDownloader.downloadModel(BertEmbeddings, name, lang, remote_loc) File "/home/beslami/.local/lib/python3.9/site-packages/sparknlp/pretrained/resource_downloader.py", line 99, in downloadModel raise e File "/home/beslami/.local/lib/python3.9/site-packages/sparknlp/pretrained/resource_downloader.py", line 96, in downloadModel j_obj = _internal._DownloadModel(reader.name, name, language, remote_loc, j_dwn).apply() File "/home/beslami/.local/lib/python3.9/site-packages/sparknlp/internal/init.py", line 352, in init super(_DownloadModel, self).init("com.johnsnowlabs.nlp.pretrained." + validator + ".downloadModel", reader, File "/home/beslami/.local/lib/python3.9/site-packages/sparknlp/internal/extended_java_wrapper.py", line 27, in init self._java_obj = self.new_java_obj(java_obj, args) File "/home/beslami/.local/lib/python3.9/site-packages/sparknlp/internal/extended_java_wrapper.py", line 37, in new_java_obj return self._new_java_obj(java_class, args) File "/home/beslami/.local/lib/python3.9/site-packages/pyspark/ml/wrapper.py", line 86, in _new_java_obj return java_obj(java_args) File "/home/beslami/.local/lib/python3.9/site-packages/py4j/java_gateway.py", line 1322, in call return_value = get_return_value( File "/home/beslami/.local/lib/python3.9/site-packages/pyspark/errors/exceptions/captured.py", line 169, in deco return f(a, **kw) File "/home/beslami/.local/lib/python3.9/site-packages/py4j/protocol.py", line 326, in get_return_value raise Py4JJavaError( py4j.protocol.Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel. : org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/beslami/cache_pretrained/biobert_pubmed_base_cased_en_2.6.2_2.4_1600449177996/metadata at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:304) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:244) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:332) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:208) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:291) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:291) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:287) at org.apache.spark.rdd.RDD.$anonfun$take$1(RDD.scala:1441) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:405) at org.apache.spark.rdd.RDD.take(RDD.scala:1435) at org.apache.spark.rdd.RDD.$anonfun$first$1(RDD.scala:1476) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:405) at org.apache.spark.rdd.RDD.first(RDD.scala:1476) at org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:587) at org.apache.spark.ml.util.DefaultParamsReader.load(ReadWrite.scala:465) at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:31) at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:24) at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:513) at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:505) at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.downloadModel(ResourceDownloader.scala:705) at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel(ResourceDownloader.scala) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: java.io.IOException: Input path does not exist: file:/home/beslami/cache_pretrained/biobert_pubmed_base_cased_en_2.6.2_2.4_1600449177996/metadata at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:278) ... 40 more

maziyarpanahi commented 7 months ago

I transferred this issue as it has licensed annotators, we cannot reproduce it in open-source library. (but I suspect the home_directory not having right permissions to download/extract the models or it is not reachable. checking /home/beslami/cache_pretrained/ path and its permissions might help)