deepjavalibrary / djl

An Engine-Agnostic Deep Learning Framework in Java
https://djl.ai
Apache License 2.0
4.09k stars 650 forks source link

NumberFormatException: Cannot parse null string when loading inside of Docker #3023

Closed DenisNovac closed 3 months ago

DenisNovac commented 7 months ago

Description

I am trying to pack pre-trained PyTorch model inside of a Docker container together with app written in Scala. (model loader code) The app works fine when running normally but in Docker i am getting following error. It looks like it can actually find the model but can't read it. Also this container runs fine when i use only DJL-provided models (such as MxNet's vgg16).

If i pass wrong model path - it fails with No model with the specified URI or the matching Input/Output type is found. so it actually sees the model when name is correct.

Expected Behavior

Model is loaded successfully

Error Message

recognizer1  | java.lang.NumberFormatException: Cannot parse null string
recognizer1  |  at java.base/java.lang.Integer.parseInt(Integer.java:630)
recognizer1  |  at java.base/java.lang.Integer.parseInt(Integer.java:786)
recognizer1  |  at ai.djl.mxnet.zoo.nlp.embedding.GloveWordEmbeddingBlockFactory.newBlock(GloveWordEmbeddingBlockFactory.java:42)
recognizer1  |  at ai.djl.repository.zoo.BaseModelLoader.createModel(BaseModelLoader.java:202)
recognizer1  |  at ai.djl.repository.zoo.BaseModelLoader.loadModel(BaseModelLoader.java:159)
recognizer1  |  at ai.djl.repository.zoo.Criteria.loadModel(Criteria.java:172)

How to Reproduce?

I can upload the image but it seems pretty useless. The error doesn't happen when running from local machine.

Steps to reproduce

(Paste the commands you ran that produced the error.)

  1. Pack pre-trained custom model together with loading app into Docker container;
  2. try to run it.

Environment Info

OS: Fedora 39 CPU: Core i7-7700HQ x86-64 JDK: tried both 11 and 17 Docker: 25.0.3

Here are my DJL dependencies:

// version "0.26.0"
 val djl = Seq(
    "ai.djl"         % "api"               % Versions.djl,
    // mxnet is used in object detection for embedded vgg16
    "ai.djl.mxnet"   % "mxnet-model-zoo"   % Versions.djl,
    "ai.djl.mxnet"   % "mxnet-engine"      % Versions.djl,
    // pytorch for nsfw detection
    "ai.djl.pytorch" % "pytorch-engine"    % Versions.djl,
    "ai.djl.pytorch" % "pytorch-model-zoo" % Versions.djl
  )

And Dockerfile:

FROM eclipse-temurin:17.0.6_10-jre-jammy

WORKDIR /opt/app

COPY ./target/scala-2.13/image-hosting-processing-recognizer-assembly-0.1.0-SNAPSHOT.jar ./
COPY synset.txt ./
COPY nsfw_model.pt ./

ENTRYPOINT ["java", "-cp", "image-hosting-processing-recognizer-assembly-0.1.0-SNAPSHOT.jar", "com.github.baklanovsoft.imagehosting.recognizer.Main"]
DenisNovac commented 7 months ago

Oh i guess i found the solution. I've made this changes to Dockerfile:

FROM eclipse-temurin:17-jre-jammy

WORKDIR /opt/app

COPY ./target/scala-2.13/image-hosting-processing-recognizer-assembly-0.1.0-SNAPSHOT.jar ./app.jar
RUN mkdir /opt/app/nsfw

ENTRYPOINT ["java", "-cp", "app.jar", "com.github.baklanovsoft.imagehosting.recognizer.Main"]

And started to mount the model in compose:

volumes:
      - recognizer1-djl-cache:/root/.djl.ai
      - "./recognizer/synset.txt:/opt/app/nsfw/synset.txt"
      - "./recognizer/nsfw_model.pt:/opt/app/nsfw/nsfw_model.pt"

But i guess the main reason is subfolder. I've been putting the model in the same folder as .jar file. I am not sure if it's a bug but feel free to close this ticket if it is expected.