deepjavalibrary / djl-demo

Demo applications showcasing DJL
https://demo.djl.ai
Apache License 2.0
307 stars 126 forks source link

Caused by: java.lang.UnsatisfiedLinkError: #183

Closed CensorKo closed 2 years ago

CensorKo commented 2 years ago

@zachgk @frankfliu @lanking520 @stu1130 @roywei

We deploy yolov5 torchscript model on aws inferentia instance. But DJL can't load libneuron_op.so file on startup.

First, libneuron_op.so exist in OS And PYTORCH_EXTRA_LIBRARY_PATH environment variable is set.

Caused by: java.lang.UnsatisfiedLinkError: /home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch_neuron/lib/libneuron_op.so: /home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch_neuron/lib/libneuron_op.so: undefined symbol: _ZN5torch3jit17parseSchemaOrNameERKSs at java.lang.ClassLoader$NativeLibrary.load(Native Method) at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1934) at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1817) at java.lang.Runtime.load0(Runtime.java:810) at java.lang.System.load(System.java:1088) at ai.djl.pytorch.engine.PtEngine.newInstance(PtEngine.java:72) ... 44 more

CensorKo commented 2 years ago

Another question, It seems that libneuron_op.so was deleted in torch-neuron==1.8.1 due to the neuron-rtd was removed. So how to use DJL with aws-neuron-dkms in Neuron Runtime 2.x (libnrt.so)? Did you have samples on it?

Confusing...

frankfliu commented 2 years ago

@CensorKao a few thing you need to check:

  1. The example currently only work with DJL 0.12.0 with torch-neuron 1.8.1
  2. You have to use pytorch precxx11 version: https://github.com/deepjavalibrary/djl-demo/blob/master/aws/inferentia/build.gradle#L21
  3. You have to install neuron sdk <= 1.15 and use old neuron runtime.

We are working 0.14.0 to make DJL work with 1.16.0 neuron sdk. If you want, you can try our 0.14.0-SNAPSHOT version. Documentation is still WIP.

CensorKo commented 2 years ago

@CensorKao a few thing you need to check:

  1. The example currently only work with DJL 0.12.0 with torch-neuron 1.8.1
  2. You have to use pytorch precxx11 version: https://github.com/deepjavalibrary/djl-demo/blob/master/aws/inferentia/build.gradle#L21
  3. You have to install neuron sdk <= 1.15 and use old neuron runtime.

We are working 0.14.0 to make DJL work with 1.16.0 neuron sdk. If you want, you can try our 0.14.0-SNAPSHOT version. Documentation is still WIP.

Thanks, but how to check neuron sdk version? I have checked all the document only get neuron-rtd version: 1.5.0.0 Should i guess neuron-rtd 1.5.0.0 equal to neuron sdk 1.15? https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/neuron-runtime/v1/nrt_start.html

CensorKo commented 2 years ago

@frankfliu

Our model was trained by yolov5 and already exported to torchscript.pt files. Now we want to run it on inferentia chips instance. So how to trace our yolov5 torchscript.pt model by using trace.py or did we need to trace it before running?

https://github.com/deepjavalibrary/djl-demo/blob/ce41d826890b768aa5d86ebec80efa46571ff12d/aws/inferentia/trace.py

frankfliu commented 2 years ago

@CensorKao I just created a demo for Huggingface model: https://github.com/deepjavalibrary/djl-demo/pull/184

frankfliu commented 2 years ago

@frankfliu

Our model was trained by yolov5 and already exported to torchscript.pt files. Now we want to run it on inferentia chips instance. So how to trace our yolov5 torchscript.pt model by using trace.py or did we need to trace it before running?

https://github.com/deepjavalibrary/djl-demo/blob/ce41d826890b768aa5d86ebec80efa46571ff12d/aws/inferentia/trace.py

You have to trace it use neuron-cc, regular torchscript won't work with inferentia.