JohnSnowLabs / spark-nlp

State of the Art Natural Language Processing
https://sparknlp.org/
Apache License 2.0
3.87k stars 712 forks source link

TensorflowWrapper.scala fails to load ClassifierDLApproach #6303

Closed kgoderis closed 2 years ago

kgoderis commented 3 years ago

Description

java.util.NoSuchElementException is thrown when doing TensorflowWrapper$.readZippedSavedModel as part of ClassifierDLApproach.loadSavedModel.

StackTrace:

    at scala.collection.Iterator$$anon$2.next(Iterator.scala:41)
    at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
    at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63)
    at scala.collection.IterableLike.head(IterableLike.scala:109)
    at scala.collection.IterableLike.head$(IterableLike.scala:108)
    at scala.collection.mutable.ArrayBuffer.scala$collection$IndexedSeqOptimized$$super$head(ArrayBuffer.scala:49)
    at scala.collection.IndexedSeqOptimized.head(IndexedSeqOptimized.scala:129)
    at scala.collection.IndexedSeqOptimized.head$(IndexedSeqOptimized.scala:129)
    at scala.collection.mutable.ArrayBuffer.head(ArrayBuffer.scala:49)
    at com.johnsnowlabs.ml.tensorflow.TensorflowWrapper$.readZippedSavedModel(TensorflowWrapper.scala:506)
    at com.johnsnowlabs.nlp.annotators.classifier.dl.ClassifierDLApproach.loadSavedModel(ClassifierDLApproach.scala:410)
    at com.johnsnowlabs.nlp.annotators.classifier.dl.ClassifierDLApproach.train(ClassifierDLApproach.scala:346)
    at com.johnsnowlabs.nlp.annotators.classifier.dl.ClassifierDLApproach.train(ClassifierDLApproach.scala:98)
    at com.johnsnowlabs.nlp.AnnotatorApproach._fit(AnnotatorApproach.scala:69)
    at com.johnsnowlabs.nlp.AnnotatorApproach.fit(AnnotatorApproach.scala:75)
    at org.apache.spark.ml.Pipeline.$anonfun$fit$5(Pipeline.scala:151)
    at org.apache.spark.ml.MLEvents.withFitEvent(events.scala:130)
    at org.apache.spark.ml.MLEvents.withFitEvent$(events.scala:123)
    at org.apache.spark.ml.util.Instrumentation.withFitEvent(Instrumentation.scala:42)
    at org.apache.spark.ml.Pipeline.$anonfun$fit$4(Pipeline.scala:151)
    at scala.collection.Iterator.foreach(Iterator.scala:941)
    at scala.collection.Iterator.foreach$(Iterator.scala:941)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
    at org.apache.spark.ml.Pipeline.$anonfun$fit$2(Pipeline.scala:147)
    at org.apache.spark.ml.MLEvents.withFitEvent(events.scala:130)
    at org.apache.spark.ml.MLEvents.withFitEvent$(events.scala:123)
    at org.apache.spark.ml.util.Instrumentation.withFitEvent(Instrumentation.scala:42)
    at org.apache.spark.ml.Pipeline.$anonfun$fit$1(Pipeline.scala:133)
    at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
    at scala.util.Try$.apply(Try.scala:213)
    at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
    at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:133)

This seems to happen in https://github.com/JohnSnowLabs/spark-nlp/blob/340fe8068fae9a83130871f31633109f5fda8e70/src/main/scala/com/johnsnowlabs/ml/tensorflow/TensorflowWrapper.scala#L510, which is called from https://github.com/JohnSnowLabs/spark-nlp/blob/340fe8068fae9a83130871f31633109f5fda8e70/src/main/scala/com/johnsnowlabs/nlp/annotators/classifier/dl/ClassifierDLApproach.scala#L426, using /classifier-dl as root directory in the call to readZippedSavedModel

/classifier-dl does not exist on the target machine, and the user running the JVM is not root

Expected Behavior

The model should be loaded

Current Behavior

Exception is thrown

Possible Solution

Steps to Reproduce

  1. Create a "blank" Google Compute cloud instance with Ubuntu 20.04 focal distro
  2. apt-get install -y --no-install-recommends git openjdk-8-jdk maven
  3. git clone .... - Build/deploy a jar that somewhere calls model = pipeline.fit(dataset), with pipeline = new Pipeline().setStages(new PipelineStage[] { getDocumentAssembler(), getTokenizer(), getEncoder(),getEmbedder(), getClassifier() }); and getClassifier() returning a new ClassifierDLApproach()
  4. java -jar /home/some_user/some_target/some.jar

Your Environment

VM settings: Max. Heap Size (Estimated): 2.88G Using VM: OpenJDK 64-Bit Server VM

spark.jars.packages : com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.2.1

Linux test 5.11.0-1020-gcp #22~20.04.1-Ubuntu SMP Tue Sep 21 10:54:26 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

maziyarpanahi commented 3 years ago

Hi,

Could you please elaborate regarding step 2 (git clone .... - Build/deploy a jar that somewhere)?

/classifier-dl does not exist on the target machine, and the user running the JVM is not root

This is not really a root directory, it's the root of the jar whether it's spark.jars.packages or spark.jars it will look into our own jar not outside. I think once I understand the step 2 as how you are actually using Spark NLP and whether or not it's a dependency of another project we can figure this out.

kgoderis commented 3 years ago

It is a Spring Boot application.

The pipeline is built using

    @Override
    protected PipelineStage getDocumentAssembler() {
        return (DocumentAssembler) new DocumentAssembler().setInputCol("text").setOutputCol("document");
    }

    @Override
    protected PipelineStage getTokenizer() {
        return (Tokenizer) ((Tokenizer) new Tokenizer().setInputCols(new String[] { "document" }))
                .setOutputCol("token");
    }

    @Override
    protected PipelineStage getEncoder() {
        return (BertEmbeddings) ((BertEmbeddings) BertEmbeddings.pretrained("bert_base_uncased", "en")
                .setInputCols(new String[] { "document", "token" })).setOutputCol("embeddings");
    }

    @Override
    protected PipelineStage getEmbedder() {
        return ((SentenceEmbeddings) ((SentenceEmbeddings) new SentenceEmbeddings()
                .setInputCols(new String[] { "document", "embeddings" })).setOutputCol("sentence_embeddings"))
                        .setPoolingStrategy("AVERAGE");
    }

    @Override
    protected PipelineStage getClassifier() {
        boolean enableOutputLogs = false;
        if (logger.isDebugEnabled()) {
            enableOutputLogs = true;
        }

        return ((ClassifierDLApproach) ((ClassifierDLApproach) new ClassifierDLApproach()
                .setInputCols(new String[] { "sentence_embeddings" })).setOutputCol("category")).setLabelColumn("label")
                        .setMaxEpochs(epochs).setLr(learningRate).setBatchSize(batchSize)
                        .setEnableOutputLogs(enableOutputLogs).setValidationSplit((float) 0.3)
                        .setVerbose(com.johnsnowlabs.nlp.annotators.ner.Verbose.All());
    }

Spark is started as

        try {
            Builder builder = SparkSession.builder().appName("Spark NLP").master("local[*]")
                    .config("spark.driver.memory",
                            Utils.humanReadableByteCountSI(Runtime.getRuntime().maxMemory(), true))
                    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
                    .config("spark.kryoserializer.buffer.max", "2000M").config("spark.driver.maxResultSize", "0")
                    .config("spark.sql.broadcastTimeout", "36000");

            if (useGPU) {
                builder.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.2.1");
            } else {
                builder.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:3.2.1");
            }

            spark = builder.getOrCreate();
            logger.info("Spark {} started", spark.version());

        } catch (Exception e) {
            Utils.logStacktrace(logger, e);
        }

and finally, pom.xml contains

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-mllib_2.12</artifactId>
            <version>${spark.version}</version>
            <exclusions>
                <exclusion>
                    <groupId>org.slf4j</groupId>
                    <artifactId>slf4j-log4j12</artifactId>
                </exclusion>
            </exclusions>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.12</artifactId>
            <version>${spark.version}</version>
            <exclusions>
                <exclusion>
                    <groupId>org.slf4j</groupId>
                    <artifactId>slf4j-log4j12</artifactId>
                </exclusion>
            </exclusions>
        </dependency>
        <dependency>
            <groupId>com.johnsnowlabs.nlp</groupId>
            <artifactId>spark-nlp_2.12</artifactId>
            <version>${spark-nlp.version}</version>
        </dependency>
        <dependency>
            <groupId>com.johnsnowlabs.nlp</groupId>
            <artifactId>spark-nlp-gpu_2.12</artifactId>
            <version>${spark-nlp.version}</version>
        </dependency>

and

    <properties>
        <spark.version>3.1.2</spark.version>
        <spark-nlp.version>3.2.1</spark-nlp.version>
    </properties>
maziyarpanahi commented 3 years ago

Thanks @kgoderis I have personally tested that annotator in almost all situations including on GCP, even if Spark NLP being a dependency of another SBT project. However, I am not sure I've ever tested it in Java like this, let me assign this to someone to test and reproduce it locally first and I'll get back to you.

kgoderis commented 3 years ago

@maziyarpanahi FYI, it works locally on my dev machine. It also works when running inside a Docker image (but on GCP those run on an optimised Debian distribution, not an Ubuntu based one, which is what I tested)

maziyarpanahi commented 3 years ago

Thanks @kgoderis We try to reproduce in the same environment, first on local Debian and then on GCP (if you can share how you config/start your GCP so we can be sure we have the same env would be great)

kgoderis commented 3 years ago

@maziyarpanahi I have tossed away the test instance, but it is plain vanilla Ubuntu 20.04, nothing fancy nor non-standard configuration. The docker based test was done in the same way via the GCP Control Panel, except adding the container image, but here I changed the underlying image to be the "stable" one from the list shown on the Control Panel, e.g I avoided the "dev" image release

maziyarpanahi commented 3 years ago

Thanks @kgoderis So the only setup that has this issue is GCP on Debian stable. We try to see why and whether or not this is about Debian or GCP or something else.

danilojsl commented 3 years ago

Hi @kgoderis,

I created a small Spring Boot app that trains a ClassifierDL model to replicate the error. I tested on Ubuntu 20, Debian 11, and it is working. I also containerized the app with Docker, tested it under Ubuntu 20, Debian 11, and GCP General-Purpose Machine and Computed-Optimised (Debian - buster), and it works, as you can see in the screenshot below.

GCP

I'm not sure how to configure the underlying image to "stable" In GCP Control Panel, I just found options for buster, bullseye, and stretch. Could you please elaborate more on how to configure it as stable? boot disk

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 5 days