apache / tvm

Open deep learning compiler stack for cpu, gpu and specialized accelerators
https://tvm.apache.org/
Apache License 2.0
11.82k stars 3.48k forks source link

[Bug] RuntimeError: Child process exited unsuccessfully with error code -6 #17495

Open MehdiTantaoui-99 opened 4 weeks ago

MehdiTantaoui-99 commented 4 weeks ago

I ran tuning on an ONNX file using python and tvmc API, but after reaching half of the tasks it throws an error which stops the tuning and makes you start from the beginning (happened multiple times)

 # Perform actual tuning with selected tasks
tvmc.tune(
    model,
    target=target,
    tuning_records=tuning_records,
    enable_autoscheduler=args.enable_autoscheduler,
    trials=args.tuning_trials,
    early_stopping=args.early_stopping,
    timeout=20,
)
print("Tuning completed.")
----------------------------------------------------------------------
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |
-----------------------------------------------------------------------------------------------------------------
|    0 |                                    vm_mod_fused_nn_conv2d_add |        0.012 |         652.45 |     18 |
|    1 |                          vm_mod_fused_nn_conv2d_add_nn_relu_5 |        0.084 |        3351.34 |     18 |
|    2 |                              vm_mod_fused_nn_conv2d_add_add_3 |        0.028 |        4974.26 |     18 |
|    3 |                          vm_mod_fused_nn_conv2d_add_nn_relu_1 |        0.169 |        4028.20 |     18 |
|    4 |                        vm_mod_fused_nn_conv2d_add_add_nn_relu |        0.304 |        5958.98 |     18 |
|    5 |                      vm_mod_fused_nn_conv2d_add_add_nn_relu_5 |        0.129 |        3509.89 |     18 |
|    6 |                          vm_mod_fused_nn_conv2d_add_nn_relu_8 |        0.124 |        1992.50 |     18 |
|    7 |                              vm_mod_fused_nn_conv2d_add_add_1 |        0.087 |        3123.05 |     18 |
|    8 |                      vm_mod_fused_nn_conv2d_add_add_nn_relu_3 |        0.255 |        4438.36 |     18 |
|    9 |                          vm_mod_fused_nn_conv2d_add_nn_relu_4 |        0.267 |        5502.94 |     18 |
|   10 |                      vm_mod_fused_nn_conv2d_add_add_nn_relu_7 |        0.082 |        3001.29 |     18 |
|   11 |                      vm_mod_fused_nn_conv2d_add_add_nn_relu_1 |        0.426 |        5669.90 |     18 |
|   12 |                              vm_mod_fused_nn_conv2d_add_add_6 |        0.023 |        2781.69 |     18 |
|   13 |                            vm_mod_fused_nn_conv2d_add_nn_relu |        0.170 |        5459.73 |     18 |
|   14 |                          vm_mod_fused_nn_conv2d_add_nn_relu_7 |        0.165 |        3657.21 |     18 |
|   15 |                                vm_mod_fused_nn_conv2d_add_add |            - |              - |      0 |
|   16 |                              vm_mod_fused_nn_conv2d_add_add_4 |            - |              - |      0 |
|   17 |                          vm_mod_fused_nn_conv2d_add_nn_relu_3 |            - |              - |      0 |
|   18 |                      vm_mod_fused_nn_conv2d_add_add_nn_relu_6 |            - |              - |      0 |
|   19 |                              vm_mod_fused_nn_conv2d_add_add_2 |            - |              - |      0 |
|   20 |                      vm_mod_fused_nn_conv2d_add_add_nn_relu_4 |            - |              - |      0 |
|   21 |                          vm_mod_fused_nn_conv2d_add_nn_relu_6 |            - |              - |      0 |
|   22 |                            vm_mod_fused_nn_conv2d_add_sigmoid |            - |              - |      0 |
|   23 |                          vm_mod_fused_nn_conv2d_add_nn_relu_2 |            - |              - |      0 |
|   24 |                      vm_mod_fused_nn_conv2d_add_add_nn_relu_2 |            - |              - |      0 |
|   25 |                              vm_mod_fused_nn_conv2d_add_add_7 |            - |              - |      0 |
|   26 |                              vm_mod_fused_nn_conv2d_add_add_5 |            - |              - |      0 |
-----------------------------------------------------------------------------------------------------------------

Expected behavior

To complete all tasks for tuning

Actual behavior

We get an error:

terminate called after throwing an instance of 'tvm::runtime::InternalError'
  what():  [13:54:11] /home/ubuntu/tvm/src/runtime/cuda/cuda_device_api.cc:312: InternalError: Check failed: (e == cudaSuccess || e == cudaErrorCudartUnloading) is false: CUDA: misaligned address
Stack trace:
  0: tvm::runtime::CUDATimerNode::~CUDATimerNode()
        at /home/ubuntu/tvm/src/runtime/cuda/cuda_device_api.cc:312
  1: tvm::runtime::SimpleObjAllocator::Handler<tvm::runtime::CUDATimerNode>::Deleter_(tvm::runtime::Object*)
        at /home/ubuntu/tvm/include/tvm/runtime/memory.h:138
  2: tvm::runtime::ObjectPtr<tvm::runtime::Object>::reset()
        at /home/ubuntu/tvm/include/tvm/runtime/object.h:455
  3: tvm::runtime::ObjectPtr<tvm::runtime::Object>::~ObjectPtr()
        at /home/ubuntu/tvm/include/tvm/runtime/object.h:404
  4: tvm::runtime::ObjectRef::~ObjectRef()
        at /home/ubuntu/tvm/include/tvm/runtime/object.h:519
  5: tvm::runtime::Timer::~Timer()
        at /home/ubuntu/tvm/include/tvm/runtime/profiling.h:86
  6: operator()
        at /home/ubuntu/tvm/src/runtime/profiling.cc:915
  7: tvm::runtime::LocalSession::CallFunc(void*, TVMValue const*, int const*, int, std::function<void (tvm::runtime::TVMArgs)> const&)
        at /home/ubuntu/tvm/src/runtime/rpc/rpc_local_session.cc:107
  8: tvm::runtime::RPCSession::AsyncCallFunc(void*, TVMValue const*, int const*, int, std::function<void (tvm::runtime::RPCCode, tvm::runtime::TVMArgs)>)
        at /home/ubuntu/tvm/src/runtime/rpc/rpc_session.cc:47
  9: tvm::runtime::RPCEndpoint::EventHandler::HandleNormalCallFunc()
        at /home/ubuntu/tvm/src/runtime/rpc/rpc_endpoint.cc:542
  10: tvm::runtime::RPCEndpoint::EventHandler::HandleProcessPacket(std::function<void (tvm::runtime::TVMArgs)>)
        at /home/ubuntu/tvm/src/runtime/rpc/rpc_endpoint.cc:362
  11: tvm::runtime::RPCEndpoint::EventHandler::HandleNextEvent(bool, bool, std::function<void (tvm::runtime::TVMArgs)>)
        at /home/ubuntu/tvm/src/runtime/rpc/rpc_endpoint.cc:136
  12: tvm::runtime::RPCEndpoint::HandleUntilReturnEvent(bool, std::function<void (tvm::runtime::TVMArgs)>)
        at /home/ubuntu/tvm/src/runtime/rpc/rpc_endpoint.cc:714
  13: tvm::runtime::RPCEndpoint::ServerLoop()
        at /home/ubuntu/tvm/src/runtime/rpc/rpc_endpoint.cc:805
  14: tvm::runtime::RPCServerLoop(int)
        at /home/ubuntu/tvm/src/runtime/rpc/rpc_socket_impl.cc:119
  15: operator()
        at /home/ubuntu/tvm/src/runtime/rpc/rpc_socket_impl.cc:138

Exception in thread Thread-1 (_listen_loop):
Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/tvm-build-venv/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
    self.run()
  File "/home/ubuntu/miniconda3/envs/tvm-build-venv/lib/python3.11/threading.py", line 982, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/tvm/python/tvm/rpc/server.py", line 279, in _listen_loop
    _serving(conn, addr, opts, load_library)
  File "/home/ubuntu/tvm/python/tvm/rpc/server.py", line 168, in _serving
    raise RuntimeError(
RuntimeError: Child process 49293 exited unsuccessfully with error code -6

Environment

PRETTY_NAME="Ubuntu 22.04.5 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.5 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
tvm version 0.19.dev0