deepjavalibrary / djl

An Engine-Agnostic Deep Learning Framework in Java
https://djl.ai
Apache License 2.0
4.07k stars 648 forks source link

Onnxruntime-gpu 1.8.0 killed the process on cpu device #3366

Open zaobao opened 1 month ago

zaobao commented 1 month ago

Environment Info

Container: Docker with NO GPU OS: AlmaLinux CUDA installed: 12.2 Cudnn installed: 8.9.0 djl version: 0.29.0 onnxruntime_gpu version: 1.8.0

Error Message

[root@r100048367-91051506-l5wvj powerop]# cat /tmp/hs_err_pid1062.log | more
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f6be8b25d12, pid=1062, tid=0x00007f6ddfdff640
#
# JRE version: OpenJDK Runtime Environment (8.0_302-b08) (build 1.8.0_302-b08)
# Java VM: OpenJDK 64-Bit Server VM (25.302-b08 mixed mode linux-amd64 )
# Problematic frame:
# C  [libonnxruntime_providers_cuda.so+0x1a4d12]
#
# Core dump written. Default location: //core or core.1062
#
# If you would like to submit a bug report, please visit:
#   https://github.com/adoptium/adoptium-support/issues
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

---------------  T H R E A D  ---------------

Current thread (0x00007f6ef6394000):  JavaThread "igniteThread" daemon [_thread_in_native, id=1579, stack(0x00007f6ddfdc0000,0x00007f6ddfe00000)]

siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000000000

Registers:
RAX=0x00007f6c07e18828, RBX=0x00007f6ddfdfc570, RCX=0x0000000000000006, RDX=0x0000000000000000
RSP=0x00007f6ddfdfc550, RBP=0x00007f6ddfdfc650, RSI=0x0000000000000000, RDI=0x00007f6ddfdfc570
R8 =0x00007f6ddd6256a0, R9 =0x00007f6ddd618db8, R10=0x0000000000000000, R11=0x00007f6ddd625700
R12=0x00007f6d5c686a80, R13=0x00007f6ddfdfc570, R14=0x00007f6c05eccc78, R15=0x0000000000000000
RIP=0x00007f6be8b25d12, EFLAGS=0x0000000000010246, CSGSFS=0x002b000000000033, ERR=0x0000000000000004
  TRAPNO=0x000000000000000e

Top of Stack: (sp=0x00007f6ddfdfc550)
0x00007f6ddfdfc550:   00007f6ddfdfc570 a662eca985aa6800
0x00007f6ddfdfc560:   00007f6ddfdfc590 00007f6be8ae3708
0x00007f6ddfdfc570:   000000770000007c 0000005d0000006e
0x00007f6ddfdfc580:   0000000000000000 0000000001180470
0x00007f6ddfdfc590:   00007f6ddfdfc5a0 0000000000000000
0x00007f6ddfdfc5a0:   00007f6d5ee78b00 00007f79a8ffa838
0x00007f6ddfdfc5b0:   0000000000000000 00007f79a8eb00fe
0x00007f6ddfdfc5c0:   0000000000000000 0000000000000000
0x00007f6ddfdfc5d0:   0000000000000020 00007f79a8ffa838
0x00007f6ddfdfc5e0:   00007f6d5ca51370 00007f6c07e19cd9
0x00007f6ddfdfc5f0:   00007f79a8ffbee8 00007f79a8e577a2
0x00007f6ddfdfc600:   0000000000000040 00007f6ddd4beda0
0x00007f6ddfdfc610:   00007f6c05eccc80 a662eca985aa6800
0x00007f6ddfdfc620:   00007f6ddd4beda0 00007f6ddfdfc650
0x00007f6ddfdfc630:   00007ffc745823a8 00007ffc74582560
0x00007f6ddfdfc640:   00007f6c05eccc78 00007f6be8a1d762
0x00007f6ddfdfc650:   0000000000011c30 0000000000000470
0x00007f6ddfdfc660:   000004a0000011c1 0000000000000002
0x00007f6ddfdfc670:   0000000000000011 000000000000008e
0x00007f6ddfdfc680:   000000790000007c 000000e90000007f
0x00007f6ddfdfc690:   00007f6d5ca3edb0 ffffffffffffffb8
0x00007f6ddfdfc6a0:   0000000000011c00 00007f6dc8000020
0x00007f6ddfdfc6b0:   00007ffc74582560 00007f6bbfe70470
0x00007f6ddfdfc6c0:   00007f6bc59ae680 a662eca985aa6800
0x00007f6ddfdfc6d0:   00007f6bc59ae680 00007f6c05ecc318
0x00007f6ddfdfc6e0:   0000000000000036 00007ffc745823a8
0x00007f6ddfdfc6f0:   00007ffc74582560 00007f6c05eccc78
0x00007f6ddfdfc700:   0000000000000000 00007f79a95cb1ee
0x00007f6ddfdfc710:   fffffffffffffff8 0000000000000036
0x00007f6ddfdfc720:   00007ffc745823a8 00007ffc74582560
0x00007f6ddfdfc730:   00007f6d5c6de6c0 00007f79a95cb2dc
0x00007f6ddfdfc740:   00007ffc745823a8 00007f6ddfdfca40

Instructions: (pc=0x00007f6be8b25d12)
0x00007f6be8b25cf2:   89 fb 48 83 ec 10 64 48 8b 04 25 28 00 00 00 48
0x00007f6be8b25d02:   89 44 24 08 31 c0 48 8d 05 19 2b 2f 1f 48 8b 30
0x00007f6be8b25d12:   48 8b 06 ff 50 30 48 8b 54 24 08 64 48 33 14 25
0x00007f6be8b25d22:   28 00 00 00 75 09 48 83 c4 10 48 89 d8 5b c3 e8

Register to memory mapping:

RAX=0x00007f6c07e18828: <offset 0x1f497828> in /opt/tomcat/temp/onnxruntime-java757573562719520016/libonnxruntime_providers_cuda.so at 0x00007f6be8981000
RBX=0x00007f6ddfdfc570 is pointing into the stack for thread: 0x00007f6ef6394000
RCX=0x0000000000000006 is an unknown value
RDX=0x0000000000000000 is an unknown value
RSP=0x00007f6ddfdfc550 is pointing into the stack for thread: 0x00007f6ef6394000
RBP=0x00007f6ddfdfc650 is pointing into the stack for thread: 0x00007f6ef6394000
RSI=0x0000000000000000 is an unknown value
RDI=0x00007f6ddfdfc570 is pointing into the stack for thread: 0x00007f6ef6394000
R8 =0x00007f6ddd6256a0: <offset 0x2256a0> in /lib64/libstdc++.so.6 at 0x00007f6ddd400000
R9 =0x00007f6ddd618db8: _ZTINSt6locale5facetE+0 in /lib64/libstdc++.so.6 at 0x00007f6ddd400000
R10=0x0000000000000000 is an unknown value
R11=0x00007f6ddd625700: <offset 0x225700> in /lib64/libstdc++.so.6 at 0x00007f6ddd400000
R12=0x00007f6d5c686a80 is an unknown value
R13=0x00007f6ddfdfc570 is pointing into the stack for thread: 0x00007f6ef6394000
R14=0x00007f6c05eccc78: <offset 0x1d54bc78> in /opt/tomcat/temp/onnxruntime-java757573562719520016/libonnxruntime_providers_cuda.so at 0x00007f6be8981000
R15=0x0000000000000000 is an unknown value

Stack: [0x00007f6ddfdc0000,0x00007f6ddfe00000],  sp=0x00007f6ddfdfc550,  free space=241k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libonnxruntime_providers_cuda.so+0x1a4d12]
C  0x0000000000000470

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  ai.onnxruntime.OrtSession$SessionOptions.addCUDA(JJI)V+0
j  ai.onnxruntime.OrtSession$SessionOptions.addCUDA(I)V+19
j  ai.onnxruntime.OrtSession$SessionOptions.addCUDA()V+2
j  ai.djl.onnxruntime.engine.OrtEngine.hasCapability(Ljava/lang/String;)Z+29
j  ai.djl.engine.Engine.defaultDevice()Lai/djl/Device;+10
j  ai.djl.ndarray.BaseNDManager.defaultDevice()Lai/djl/Device;+4
j  ai.djl.ndarray.BaseNDManager.<init>(Lai/djl/ndarray/NDManager;Lai/djl/Device;)V+39
j  ai.djl.onnxruntime.engine.OrtNDManager.<init>(Lai/djl/ndarray/NDManager;Lai/djl/Device;Lai/onnxruntime/OrtEnvironment;)V+3
j  ai.djl.onnxruntime.engine.OrtNDManager.<init>(Lai/djl/ndarray/NDManager;Lai/djl/Device;Lai/onnxruntime/OrtEnvironment;Lai/djl/onnxruntime/engine/OrtNDManager$1;)V+4
j  ai.djl.onnxruntime.engine.OrtNDManager$SystemManager.<init>()V+15
j  ai.djl.onnxruntime.engine.OrtNDManager.<clinit>()V+4
v  ~StubRoutines::call_stub
j  ai.djl.onnxruntime.engine.OrtEngine.newBaseManager(Lai/djl/Device;)Lai/djl/ndarray/NDManager;+0
j  ai.djl.onnxruntime.engine.OrtEngine.newModel(Ljava/lang/String;Lai/djl/Device;)Lai/djl/Model;+7
j  ai.djl.Model.newInstance(Ljava/lang/String;Lai/djl/Device;Ljava/lang/String;)Lai/djl/Model;+23
j  ai.djl.repository.zoo.BaseModelLoader.createModel(Ljava/nio/file/Path;Ljava/lang/String;Lai/djl/Device;Lai/djl/nn/Block;Ljava/util/Map;Ljava/lang/String;)Lai/djl/Model;+4
j  ai.djl.repository.zoo.BaseModelLoader.loadModel(Lai/djl/repository/zoo/Criteria;)Lai/djl/repository/zoo/ZooModel;+506
j  ai.djl.repository.zoo.Criteria.loadModel()Lai/djl/repository/zoo/ZooModel;+524

What have you tried to solve it?

I made a change to ai.djl.engine.Engine.java, and the problem no longer reproduces

    public Device defaultDevice() {
        if (defaultDevice == null) {
            if (CudaUtils.getGpuCount() > 0 && hasCapability(StandardCapabilities.CUDA)) { // check gpu-count first
                defaultDevice = Device.gpu();
            } else {
                defaultDevice = Device.cpu();
            }
        }
        return defaultDevice;
    }
frankfliu commented 1 month ago

Why you use onnxruntime_gpu dependency in a machine without GPU?

Justubborn commented 2 weeks ago

have same question use onnxruntime-1.18.0 Container: Docker with NO GPU OS: openEuler djl version: 0.29.0 onnxruntime_gpu version: 1.18.0

#
#  SIGSEGV (0xb) at pc=0x00007fb6285e1e3b, pid=885, tid=917
#
# JRE version: Java(TM) SE Runtime Environment (17.0.12+8) (build 17.0.12+8-LTS-286)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (17.0.12+8-LTS-286, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# C  [libstdc++.so.6+0xd3e3b]
Time: Fri Aug 30 03:13:24 2024 UTC elapsed time: 28.554432 seconds (0d 0h 0m 28s)

---------------  T H R E A D  ---------------

Current thread (0x00007fb590081400):  JavaThread "XNIO-1 task-2" [_thread_in_native, id=917, stack(0x00007fb63823a000,0x00007fb63833a000)]

Stack: [0x00007fb63823a000,0x00007fb63833a000],  sp=0x00007fb6383332d8,  free space=996k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libstdc++.so.6+0xd3e3b]

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  jdk.internal.loader.NativeLibraries.load(Ljdk/internal/loader/NativeLibraries$NativeLibraryImpl;Ljava/lang/String;ZZZ)Z+0 java.base@17.0.12
j  jdk.internal.loader.NativeLibraries$NativeLibraryImpl.open()Z+61 java.base@17.0.12
j  jdk.internal.loader.NativeLibraries.loadLibrary(Ljava/lang/Class;Ljava/lang/String;Z)Ljdk/internal/loader/NativeLibrary;+256 java.base@17.0.12
j  jdk.internal.loader.NativeLibraries.loadLibrary(Ljava/lang/Class;Ljava/io/File;)Ljdk/internal/loader/NativeLibrary;+51 java.base@17.0.12
j  java.lang.ClassLoader.loadLibrary(Ljava/lang/Class;Ljava/io/File;)Ljdk/internal/loader/NativeLibrary;+31 java.base@17.0.12
j  java.lang.Runtime.load0(Ljava/lang/Class;Ljava/lang/String;)V+61 java.base@17.0.12
j  java.lang.System.load(Ljava/lang/String;)V+7 java.base@17.0.12
j  ai.djl.pytorch.jni.LibUtils.loadNativeLibrary(Ljava/lang/String;)V+39
j  ai.djl.pytorch.jni.LibUtils.loadLibTorch(Lai/djl/pytorch/jni/LibUtils$LibTorch;)V+548
j  ai.djl.pytorch.jni.LibUtils.loadLibrary()V+28
j  ai.djl.pytorch.engine.PtEngine.newInstance()Lai/djl/engine/Engine;+0
j  ai.djl.pytorch.engine.PtEngineProvider.getEngine()Lai/djl/engine/Engine;+17
j  ai.djl.engine.Engine.getEngine(Ljava/lang/String;)Lai/djl/engine/Engine;+45
j  ai.djl.engine.Engine.getInstance()Lai/djl/engine/Engine;+43
j  ai.djl.onnxruntime.engine.OrtEngine.getAlternativeEngine()Lai/djl/engine/Engine;+15
j  ai.djl.ndarray.BaseNDManager.<init>(Lai/djl/ndarray/NDManager;Lai/djl/Device;)V+85
j  ai.djl.onnxruntime.engine.OrtNDManager.<init>(Lai/djl/ndarray/NDManager;Lai/djl/Device;Lai/onnxruntime/OrtEnvironment;)V+3
j  ai.djl.onnxruntime.engine.OrtNDManager.<init>(Lai/djl/ndarray/NDManager;Lai/djl/Device;Lai/onnxruntime/OrtEnvironment;Lai/djl/onnxruntime/engine/OrtNDManager$1;)V+4
j  ai.djl.onnxruntime.engine.OrtNDManager$SystemManager.<init>()V+15
j  ai.djl.onnxruntime.engine.OrtNDManager.<clinit>()V+4
v  ~StubRoutines::call_stub
j  ai.djl.onnxruntime.engine.OrtEngine.newBaseManager(Lai/djl/Device;)Lai/djl/ndarray/NDManager;+0
j  ai.djl.onnxruntime.engine.OrtEngine.newModel(Ljava/lang/String;Lai/djl/Device;)Lai/djl/Model;+7
j  ai.djl.Model.newInstance(Ljava/lang/String;Lai/djl/Device;Ljava/lang/String;)Lai/djl/Model;+23
j  ai.djl.repository.zoo.BaseModelLoader.createModel(Ljava/nio/file/Path;Ljava/lang/String;Lai/djl/Device;Lai/djl/nn/Block;Ljava/util/Map;Ljava/lang/String;)Lai/djl/Model;+4
j  ai.djl.repository.zoo.BaseModelLoader.loadModel(Lai/djl/repository/zoo/Criteria;)Lai/djl/repository/zoo/ZooModel;+506
j  ai.djl.repository.zoo.Criteria.loadModel()Lai/djl/repository/zoo/ZooModel;+524
j  ai.djl.repository.zoo.ModelZoo.loadModel(Lai/djl/repository/zoo/Criteria;)Lai/djl/repository/zoo/ZooModel;+1
j  org.aoju.bus.ocr.toolkit.OcrV4Kit.runOcr(Ljava/io/InputStream;)Lorg/aoju/bus/ocr/entity/OcrResult;+50
j  cn.econta.tangor.service.OcrService.sync([B)Lorg/aoju/bus/ocr/entity/OcrResult;+10
j  cn.econta.tangor.spring.OcrController.jsonPpWorld(Ljava/lang/String;)Ljava/lang/Object;+15
v  ~StubRoutines::call_stub
frankfliu commented 2 weeks ago

Onnx has cpu and _gpu two jar file. I don't think you can mismatch.

Justubborn commented 2 weeks ago

only use onnx cpu with pytorch cause java crash