bytedeco / javacpp-presets

The missing Java distribution of native C++ libraries
Other
2.65k stars 737 forks source link

Pytorch Tensor set_data() api crash #1258

Closed lzmchina closed 10 months ago

lzmchina commented 1 year ago

In the last issue, I want to update model's parameters by using tensor's set_data api :https://github.com/bytedeco/javacpp-presets/issues/1255.

But it always crashes:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fc7f147457c, pid=2564679, tid=2565061
#
# JRE version: OpenJDK Runtime Environment Temurin-11.0.13+8 (11.0.13+8) (build 11.0.13+8)
# Java VM: OpenJDK 64-Bit Server VM Temurin-11.0.13+8 (11.0.13+8, mixed mode, tiered, g1 gc, linux-amd64)
# Problematic frame:
# C  [libtorch_cpu.so+0x19cf57c]  c10::DispatchKeySet c10::DispatchKeyExtractor::getDispatchKeySetUnboxed<at::Tensor const&, at::Tensor const&>(at::Tensor const& const&, at::Tensor const& const&) const+0xc
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/libexec/abrt-hook-ccpp %s %c %p %u %g %t e %P %I %h" (or dumping to /home/lzm/dl4flink/java/core.2564679)
#
# If you would like to submit a bug report, please visit:
#   https://github.com/adoptium/adoptium-support/issues
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

---------------  S U M M A R Y ------------

Command Line: -Xmx32G nju.lzm.DataStreamJobTest

Host: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz, 20 cores, 125G, CentOS Linux release 7.9.2009 (Core)
Time: Fri Nov  4 16:43:13 2022 CST elapsed time: 13.115509 seconds (0d 0h 0m 13s)

---------------  T H R E A D  ---------------

Current thread (0x00007fca2821f000):  JavaThread "Forward: vgg19/vgg19_1 (1/1)#0" [_thread_in_native, id=2565061, stack(0x00007fc8c47c8000,0x00007fc8c48c9000)]

Stack: [0x00007fc8c47c8000,0x00007fc8c48c9000],  sp=0x00007fc8c48c6de0,  free space=1019k
Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libtorch_cpu.so+0x19cf57c]  c10::DispatchKeySet c10::DispatchKeyExtractor::getDispatchKeySetUnboxed<at::Tensor const&, at::Tensor const&>(at::Tensor const& const&, at::Tensor const& const&) const+0xc

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  org.bytedeco.pytorch.TensorBase.set_data(Lorg/bytedeco/pytorch/TensorBase;)V+0

I'm really confused!

saudet commented 1 year ago

IIRC, Tensor.set_data() will not copy the data, but only create a reference to it, so you need to make sure that the data from the other Tensor doesn't get deallocated.

lzmchina commented 1 year ago

IIRC, Tensor.set_data() will not copy the data, but only create a reference to it, so you need to make sure that the data from the other Tensor doesn't get deallocated.

So I should use copy_ api?

HGuillemet commented 1 year ago

Have you solved the problem related to this question ? If you have, please close the issue and issue #1255. If you haven't: it may be possible to direcly modify the named_parameters dictionary, using erase() and insert(), but it's probably safer to copy the data instead with copy_().