apache / tvm

Open deep learning compiler stack for cpu, gpu and specialized accelerators
https://tvm.apache.org/
Apache License 2.0
11.67k stars 3.45k forks source link

[Bug] Resizing terminal causes TVM RPC segfault #17063

Open happyme531 opened 4 months ago

happyme531 commented 4 months ago

As the title said, when I use TVM MetaSchdule and RPC to run tuning on another device, when I resize the terminal of host tuning proccess, a RPC runner process on host will immediately segfault.

Expected behavior

TVM won't segfault.

Actual behavior

2024-06-04 21:24:22 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #112: "conv2d21"
2024-06-04 21:24:31 [INFO] [task_scheduler.cc:193] Sending 64 sample(s) to builder
!!!!!!! TVM encountered a Segfault !!!!!!!
Stack trace:
  0: tvm::runtime::(anonymous namespace)::backtrace_handler(int)
        at /home/zt/rk3588-nn/tvm/src/runtime/logging.cc:214
  1: 0x00007f925569fadf
  2: tvm::runtime::EnvCAPIRegistry::CheckSignals()
        at /home/zt/rk3588-nn/tvm/src/runtime/registry.cc:186
  3: long tvm::support::RetryCallOnEINTR<tvm::support::TCPSocket::Recv(void*, unsigned long, int)::{lambda()#1}, int (*)()>(tvm::support::TCPSocket::Recv(void*, unsigned long, int)::{lambda()#1}, int (*)())
        at /home/zt/rk3588-nn/tvm/src/runtime/rpc/../../support/errno_handling.h:58
  4: tvm::support::TCPSocket::Recv(void*, unsigned long, int)
        at /home/zt/rk3588-nn/tvm/src/runtime/rpc/../../support/socket.h:481
  5: tvm::runtime::SockChannel::Recv(void*, unsigned long)
        at /home/zt/rk3588-nn/tvm/src/runtime/rpc/rpc_socket_impl.cc:56
  6: tvm::runtime::RPCEndpoint::HandleUntilReturnEvent(bool, std::function<void (tvm::runtime::TVMArgs)>)::$_1::operator()(void*, unsigned long) const
        at /home/zt/rk3588-nn/tvm/src/runtime/rpc/rpc_endpoint.cc:705
  7: unsigned long tvm::support::RingBuffer::WriteWithCallback<tvm::runtime::RPCEndpoint::HandleUntilReturnEvent(bool, std::function<void (tvm::runtime::TVMArgs)>)::$_1>(tvm::runtime::RPCEndpoint::HandleUntilReturnEvent(bool, std::function<void (tvm::runtime::TVMArgs)>)::$_1, unsigned long)
        at /home/zt/rk3588-nn/tvm/src/runtime/rpc/../../support/ring_buffer.h:174
  8: tvm::runtime::RPCEndpoint::HandleUntilReturnEvent(bool, std::function<void (tvm::runtime::TVMArgs)>)
        at /home/zt/rk3588-nn/tvm/src/runtime/rpc/rpc_endpoint.cc:704
  9: tvm::runtime::RPCEndpoint::CallFunc(void*, TVMValue const*, int const*, int, std::function<void (tvm::runtime::TVMArgs)>)
        at /home/zt/rk3588-nn/tvm/src/runtime/rpc/rpc_endpoint.cc:870
  10: tvm::runtime::RPCClientSession::CallFunc(void*, TVMValue const*, int const*, int, std::function<void (tvm::runtime::TVMArgs)> const&)
        at /home/zt/rk3588-nn/tvm/src/runtime/rpc/rpc_endpoint.cc:1087
  11: tvm::runtime::RPCWrappedFunc::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
        at /home/zt/rk3588-nn/tvm/src/runtime/rpc/rpc_module.cc:129

2024-06-04 21:24:42 [INFO] [task_scheduler.cc:195] Sending 64 sample(s) to runner

Environment

Host: Manjaro Linux 24.0.1 TVM master branch 78a1f80bf24f1a1114f2ed7d17563d267bb38cc9

Device: RK3588 ARM SoC Debian 11 TVM master branch 78a1f80bf24f1a1114f2ed7d17563d267bb38cc9

Steps to reproduce

# %%
import tvm
from tvm import relay
from tvm import relax
from tvm.relax.frontend.onnx import from_onnx
from tvm.relax.testing import relay_translator
from tvm.driver.tvmc.transform import apply_graph_transforms
import onnx
import tvm.testing
import tvm.topi.testing
from tvm.ir.module import IRModule
from tvm import meta_schedule as ms
import tvm.tir.tensor_intrin.arm_cpu 
from tvm.meta_schedule.runner import (
    EvaluatorConfig,
    LocalRunner,
    PyRunner,
    RPCConfig,
    RPCRunner,
)

# %%
target = tvm.target.Target("llvm -mtriple=aarch64-linux-gnu -mcpu=cortex-a76 -num-cores=1")
onnx_model_path = "yolov5s.onnx" 
shape_dict = {"images": (1, 3, 640, 640)}

# %%
onnx_model = onnx.load(onnx_model_path)
mod0, params = relay.frontend.from_onnx(onnx_model, shape_dict)
mod: IRModule = relay_translator.from_relay(mod0["main"], target, params)
mod = apply_graph_transforms(
    mod,
    {
        "mixed_precision": True,
        "mixed_precision_calculation_type": "float16",
        "mixed_precision_acc_type": "float16",
    },
)
rpc_config = RPCConfig(
    tracker_host="127.0.0.1",
    tracker_port=9190,
    tracker_key="rk3588", 
    session_priority=1,
    session_timeout_sec=10,
)
evaluator_config = EvaluatorConfig(
    number=1,
    repeat=1,
    min_repeat_ms=5,
    enable_cpu_cache_flush=True,
)
runner = RPCRunner(rpc_config, evaluator_config)
database = ms.relax_integration.tune_relax(
    mod=mod,
    params=params,
    target=target,
    max_trials_global=10000, 
    runner=runner,
    work_dir="./work2",
    seed=0
)

# %%
# Compile the best schedule
lib = ms.relay_integration.compile_relay(
    database=database,
    mod=mod,
    params=params,
    target=target,
)

# %%
import tvm.driver.tvmc.model as tvmc_model
model = tvmc_model.TVMCModel(mod, params)
model.export_package(lib, onnx_model_path.replace(".onnx", ".tar"), "aarch64-linux-gnu-gcc") 

Triage