Open behnazeslami opened 7 months ago
I transferred this issue as it has licensed annotators, we cannot reproduce it in open-source library. (but I suspect the home_directory not having right permissions to download/extract the models or it is not reachable. checking /home/beslami/cache_pretrained/
path and its permissions might help)
Hi, In my CentOS Linux, I installed:
1- ! pip install --upgrade -q pyspark==3.4.1 spark-nlp==5.2.2
2- ! pip install --upgrade spark-nlp-jsl==5.2.1 --user --extra-index-url https://pypi.johnsnowlabs.com/[secret_code]
I checked the java --version:java -version
openjdk version "11.0.13" 2021-10-19 OpenJDK Runtime Environment JBR-11.0.13.7-1751.21-jcef (build 11.0.13+7-b1751.21) OpenJDK 64-Bit Server VM JBR-11.0.13.7-1751.21-jcef (build 11.0.13+7-b1751.21, mixed mode)
in the
~/.bashrc
:JAVA_HOME: export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.332.b09-1.el7_9.x86_64
I am trying to run the following program:` import sparknlp import sparknlp_jsl
from sparknlp.base import from sparknlp.util import from sparknlp.annotator import from sparknlp_jsl.annotator import
from pyspark.sql import SparkSession from pyspark.sql import functions as F from pyspark.ml import Pipeline, PipelineModel from pyspark.sql.functions import col from pyspark.sql.functions import explode
from sparknlp.pretrained import PretrainedPipeline
import gc
import pandas as pd pd.set_option('display.max_columns', None) pd.set_option('display.expand_frame_repr', False) pd.set_option('max_colwidth', None)
import string import numpy as np
%%
params = {"spark.driver.memory":"50G", "spark.kryoserializer.buffer.max":"2000M", "spark.serializer": "org.apache.spark.serializer.KryoSerializer", "spark.driver.maxResultSize":"16G"}
spark = sparknlp_jsl.start(license_keys['SECRET'], params=params, gpu=True)
print ("Spark NLP Version :", sparknlp.version()) print ("Spark NLP_JSL Version :", sparknlp_jsl.version())
print(spark) print("\n========================================================================") document = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")
sentenceDetector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence")
token = Tokenizer()\ .setInputCols(['sentence'])\ .setOutputCol('token')
embeddings = BertEmbeddings.pretrained("biobert_pubmed_base_cased")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_clinical_biobert", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner")
ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk")
clinical_assertion = AssertionDLModel.pretrained("assertion_dl_biobert_scope_L10R10","en", "clinical/models") \ .setInputCols(["sentence", "ner_chunk", "embeddings"]) \ .setOutputCol("assertion")
chunk2doc = Chunk2Doc() \ .setInputCols("ner_chunk") \ .setOutputCol("ner_chunk_doc")
sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings")
snomed_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_snomed_findings_aux_concepts", "en", "clinical/models") \ .setInputCols(["sbert_embeddings"]) \ .setOutputCol("snomed_code")\ .setDistanceFunction("COSINE")\ .setCaseSensitive(False)\ .setUseAuxLabel(True)\ .setNeighbours(10)
resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_umls_findings","en", "clinical/models") \ .setInputCols(["ner_chunk", "sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN")
nlpPipeline = Pipeline(stages=[document, sentenceDetector, token, embeddings, clinical_ner, ner_converter, clinical_assertion, chunk2doc, sbert_embedder, snomed_resolver, resolver])
data = spark.createDataFrame([[""]]).toDF("text")
assertion_model = nlpPipeline.fit(data)`
However, I get the following error:
:: retrieving :: org.apache.spark#spark-submit-parent-b59223ac-26d8-44de-a4c3-d05a558c3faf confs: [default] 0 artifacts copied, 72 already retrieved (0kB/31ms) 24/03/06 20:34:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark NLP Version : 5.2.2 Spark NLP_JSL Version : 5.2.1 <pyspark.sql.session.SparkSession object at 0x7f5597073190>
======================================================================== biobert_pubmed_base_cased download started this may take some time. Approximate size to download 386.4 MB [ | ]biobert_pubmed_base_cased download started this may take some time. Approximate size to download 386.4 MB [ / ]Download done! Loading the resource.
An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel. : org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/beslami/cache_pretrained/biobert_pubmed_base_cased_en_2.6.2_2.4_1600449177996/metadata at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:304) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:244) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:332) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:208) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:291) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:291) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:287) at org.apache.spark.rdd.RDD.$anonfun$take$1(RDD.scala:1441) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:405) at org.apache.spark.rdd.RDD.take(RDD.scala:1435) at org.apache.spark.rdd.RDD.$anonfun$first$1(RDD.scala:1476) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:405) at org.apache.spark.rdd.RDD.first(RDD.scala:1476) at org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:587) at org.apache.spark.ml.util.DefaultParamsReader.load(ReadWrite.scala:465) at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:31) at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:24) at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:513) at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:505) at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.downloadModel(ResourceDownloader.scala:705) at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel(ResourceDownloader.scala) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: java.io.IOException: Input path does not exist: file:/home/beslami/cache_pretrained/biobert_pubmed_base_cased_en_2.6.2_2.4_1600449177996/metadata at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:278) ... 40 more [OK!] Traceback (most recent call last): File "/data/beslami/sample_loaded_models.py", line 75, in
embeddings = BertEmbeddings.pretrained("biobert_pubmed_base_cased")\
File "/home/beslami/.local/lib/python3.9/site-packages/sparknlp/annotator/embeddings/bert_embeddings.py", line 206, in pretrained
return ResourceDownloader.downloadModel(BertEmbeddings, name, lang, remote_loc)
File "/home/beslami/.local/lib/python3.9/site-packages/sparknlp/pretrained/resource_downloader.py", line 99, in downloadModel
raise e
File "/home/beslami/.local/lib/python3.9/site-packages/sparknlp/pretrained/resource_downloader.py", line 96, in downloadModel
j_obj = _internal._DownloadModel(reader.name, name, language, remote_loc, j_dwn).apply()
File "/home/beslami/.local/lib/python3.9/site-packages/sparknlp/internal/init.py", line 352, in init
super(_DownloadModel, self).init("com.johnsnowlabs.nlp.pretrained." + validator + ".downloadModel", reader,
File "/home/beslami/.local/lib/python3.9/site-packages/sparknlp/internal/extended_java_wrapper.py", line 27, in init
self._java_obj = self.new_java_obj(java_obj, args)
File "/home/beslami/.local/lib/python3.9/site-packages/sparknlp/internal/extended_java_wrapper.py", line 37, in new_java_obj
return self._new_java_obj(java_class, args)
File "/home/beslami/.local/lib/python3.9/site-packages/pyspark/ml/wrapper.py", line 86, in _new_java_obj
return java_obj(java_args)
File "/home/beslami/.local/lib/python3.9/site-packages/py4j/java_gateway.py", line 1322, in call
return_value = get_return_value(
File "/home/beslami/.local/lib/python3.9/site-packages/pyspark/errors/exceptions/captured.py", line 169, in deco
return f(a, **kw)
File "/home/beslami/.local/lib/python3.9/site-packages/py4j/protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/beslami/cache_pretrained/biobert_pubmed_base_cased_en_2.6.2_2.4_1600449177996/metadata
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:304)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:244)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:332)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:208)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:291)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:291)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:287)
at org.apache.spark.rdd.RDD.$anonfun$take$1(RDD.scala:1441)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:405)
at org.apache.spark.rdd.RDD.take(RDD.scala:1435)
at org.apache.spark.rdd.RDD.$anonfun$first$1(RDD.scala:1476)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:405)
at org.apache.spark.rdd.RDD.first(RDD.scala:1476)
at org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:587)
at org.apache.spark.ml.util.DefaultParamsReader.load(ReadWrite.scala:465)
at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:31)
at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:24)
at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:513)
at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:505)
at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.downloadModel(ResourceDownloader.scala:705)
at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel(ResourceDownloader.scala)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.io.IOException: Input path does not exist: file:/home/beslami/cache_pretrained/biobert_pubmed_base_cased_en_2.6.2_2.4_1600449177996/metadata
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:278)
... 40 more