Closed kgoderis closed 2 years ago
Hi,
Could you please elaborate regarding step 2 (git clone .... - Build/deploy a jar that somewhere)?
/classifier-dl does not exist on the target machine, and the user running the JVM is not root
This is not really a root directory, it's the root of the jar whether it's spark.jars.packages
or spark.jars
it will look into our own jar not outside. I think once I understand the step 2 as how you are actually using Spark NLP and whether or not it's a dependency of another project we can figure this out.
It is a Spring Boot application.
The pipeline is built using
@Override
protected PipelineStage getDocumentAssembler() {
return (DocumentAssembler) new DocumentAssembler().setInputCol("text").setOutputCol("document");
}
@Override
protected PipelineStage getTokenizer() {
return (Tokenizer) ((Tokenizer) new Tokenizer().setInputCols(new String[] { "document" }))
.setOutputCol("token");
}
@Override
protected PipelineStage getEncoder() {
return (BertEmbeddings) ((BertEmbeddings) BertEmbeddings.pretrained("bert_base_uncased", "en")
.setInputCols(new String[] { "document", "token" })).setOutputCol("embeddings");
}
@Override
protected PipelineStage getEmbedder() {
return ((SentenceEmbeddings) ((SentenceEmbeddings) new SentenceEmbeddings()
.setInputCols(new String[] { "document", "embeddings" })).setOutputCol("sentence_embeddings"))
.setPoolingStrategy("AVERAGE");
}
@Override
protected PipelineStage getClassifier() {
boolean enableOutputLogs = false;
if (logger.isDebugEnabled()) {
enableOutputLogs = true;
}
return ((ClassifierDLApproach) ((ClassifierDLApproach) new ClassifierDLApproach()
.setInputCols(new String[] { "sentence_embeddings" })).setOutputCol("category")).setLabelColumn("label")
.setMaxEpochs(epochs).setLr(learningRate).setBatchSize(batchSize)
.setEnableOutputLogs(enableOutputLogs).setValidationSplit((float) 0.3)
.setVerbose(com.johnsnowlabs.nlp.annotators.ner.Verbose.All());
}
Spark is started as
try {
Builder builder = SparkSession.builder().appName("Spark NLP").master("local[*]")
.config("spark.driver.memory",
Utils.humanReadableByteCountSI(Runtime.getRuntime().maxMemory(), true))
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.kryoserializer.buffer.max", "2000M").config("spark.driver.maxResultSize", "0")
.config("spark.sql.broadcastTimeout", "36000");
if (useGPU) {
builder.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.2.1");
} else {
builder.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:3.2.1");
}
spark = builder.getOrCreate();
logger.info("Spark {} started", spark.version());
} catch (Exception e) {
Utils.logStacktrace(logger, e);
}
and finally, pom.xml contains
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.12</artifactId>
<version>${spark.version}</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>${spark.version}</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
<version>${spark-nlp.version}</version>
</dependency>
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.12</artifactId>
<version>${spark-nlp.version}</version>
</dependency>
and
<properties>
<spark.version>3.1.2</spark.version>
<spark-nlp.version>3.2.1</spark-nlp.version>
</properties>
Thanks @kgoderis I have personally tested that annotator in almost all situations including on GCP, even if Spark NLP being a dependency of another SBT project. However, I am not sure I've ever tested it in Java like this, let me assign this to someone to test and reproduce it locally first and I'll get back to you.
@maziyarpanahi FYI, it works locally on my dev machine. It also works when running inside a Docker image (but on GCP those run on an optimised Debian distribution, not an Ubuntu based one, which is what I tested)
Thanks @kgoderis We try to reproduce in the same environment, first on local Debian and then on GCP (if you can share how you config/start your GCP so we can be sure we have the same env would be great)
@maziyarpanahi I have tossed away the test instance, but it is plain vanilla Ubuntu 20.04, nothing fancy nor non-standard configuration. The docker based test was done in the same way via the GCP Control Panel, except adding the container image, but here I changed the underlying image to be the "stable" one from the list shown on the Control Panel, e.g I avoided the "dev" image release
Thanks @kgoderis So the only setup that has this issue is GCP on Debian stable. We try to see why and whether or not this is about Debian or GCP or something else.
Hi @kgoderis,
I created a small Spring Boot app that trains a ClassifierDL model to replicate the error. I tested on Ubuntu 20, Debian 11, and it is working. I also containerized the app with Docker, tested it under Ubuntu 20, Debian 11, and GCP General-Purpose Machine and Computed-Optimised (Debian - buster), and it works, as you can see in the screenshot below.
I'm not sure how to configure the underlying image to "stable" In GCP Control Panel, I just found options for buster, bullseye, and stretch. Could you please elaborate more on how to configure it as stable?
This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 5 days
Description
java.util.NoSuchElementException is thrown when doing TensorflowWrapper$.readZippedSavedModel as part of ClassifierDLApproach.loadSavedModel.
StackTrace:
This seems to happen in https://github.com/JohnSnowLabs/spark-nlp/blob/340fe8068fae9a83130871f31633109f5fda8e70/src/main/scala/com/johnsnowlabs/ml/tensorflow/TensorflowWrapper.scala#L510, which is called from https://github.com/JohnSnowLabs/spark-nlp/blob/340fe8068fae9a83130871f31633109f5fda8e70/src/main/scala/com/johnsnowlabs/nlp/annotators/classifier/dl/ClassifierDLApproach.scala#L426, using /classifier-dl as root directory in the call to readZippedSavedModel
/classifier-dl does not exist on the target machine, and the user running the JVM is not root
Expected Behavior
The model should be loaded
Current Behavior
Exception is thrown
Possible Solution
Steps to Reproduce
Your Environment
VM settings: Max. Heap Size (Estimated): 2.88G Using VM: OpenJDK 64-Bit Server VM
spark.jars.packages : com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.2.1
Linux test 5.11.0-1020-gcp #22~20.04.1-Ubuntu SMP Tue Sep 21 10:54:26 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux