Open PigletOS opened 2 weeks ago
Hi,
Inf1 failed to execute the model after a long time. Here is the logs:
2024-Nov-06 06:40:46.0629 2895:2895 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (end:2) 2024-Nov-06 06:40:46.0629 2895:2895 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (end:0) 2024-Nov-06 06:40:46.0629 2895:2895 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (end:1) 2024-Nov-06 06:40:46.0629 2895:2895 ERROR TDRV:consume_model_start_extra_notifications_v1(FATAL-RT-UNDEFINED-STATE) model start timeout (2000 ms) on Neuron Device 0 NC 1, waiting for execution completion notification 2024-Nov-06 06:40:46.0629 2895:2895 ERROR NMGR:dlr_kelf_start_no_lock Model (1001) start failed for VNC=0, ret: 5 2024-Nov-06 06:40:46.0629 2895:2895 ERROR NMGR:tpbs_infer_lock Failed to start model 2024-Nov-06 06:40:46.0629 2895:2895 ERROR NMGR:dlr_infer Failed to acquire infer locks [2024-11-06 06:40:46,642][pid=27][ERROR] error is The following operation failed in the TorchScript interpreter. Traceback of TorchScript, serialized code (most recent call last): File "code/__torch__/torch_neuron/runtime/___torch_mangle_3435.py", line 373, in forward model = _NeuronGraph_1981.model _337 = [argument_2, argument_3, argument_4, argument_5, argument_6, _306, _321, argument_9, _336, argument_11, argument_12, _216, argument_14, argument_15, _261, _276, argument_18, argument_19, argument_20, argument_21, argument_22, argument_23, argument_24] _338 = ops.neuron.forward_v2_1(_337, model) ~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE return _338 Traceback of TorchScript, original code (most recent call last): /root/miniconda/lib/python3.9/site-packages/torch/_ops.py(442): __call__ /root/miniconda/lib/python3.9/site-packages/torch_neuron/decorators.py(416): forward /root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1182): _slow_forward /root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1194): _call_impl /root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(580): __call__ /root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(209): run_op /root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(198): __call__ /root/miniconda/lib/python3.9/site-packages/torch_neuron/runtime.py(69): forward /root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1182): _slow_forward /root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1194): _call_impl /root/miniconda/lib/python3.9/site-packages/torch/jit/_trace.py(976): trace_module /root/miniconda/lib/python3.9/site-packages/torch/jit/_trace.py(759): trace /root/miniconda/lib/python3.9/site-packages/torch_neuron/tensorboard.py(324): tb_parse /root/miniconda/lib/python3.9/site-packages/torch_neuron/tensorboard.py(550): tb_graph /root/miniconda/lib/python3.9/site-packages/torch_neuron/decorators.py(526): maybe_generate_tb_graph_def /root/miniconda/lib/python3.9/site-packages/torch_neuron/convert.py(580): maybe_determine_names_from_tensorboard /root/miniconda/lib/python3.9/site-packages/torch_neuron/convert.py(233): trace /root/miniconda/lib/python3.9/site-packages/DeepEngine/tools/convert_onnx_to_torch_neuron.py(62): convert_onnx2neuron /data1/zhuxinyu/GitLab/DeepEngine/scripts/convert/convert_neuron_from_s3.py(24): run /data1/zhuxinyu/GitLab/DeepEngine/scripts/convert/convert_neuron_from_s3.py(67): <module> RuntimeError: Failed to execute the model status=5 message=Timeout Exceeded , trace back log Traceback (most recent call last): File "/root/miniconda/lib/python3.9/site-packages/DeepEngine/engines/engine.py", line 43, in __call__ return self.forward(*xargs, **kwargs) File "/root/miniconda/lib/python3.9/site-packages/DeepEngine/engines/TorchNeuron.py", line 135, in forward ret = self.model(*args) File "/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) RuntimeError: The following operation failed in the TorchScript interpreter. Traceback of TorchScript, serialized code (most recent call last): File "code/__torch__/torch_neuron/runtime/___torch_mangle_3435.py", line 373, in forward model = _NeuronGraph_1981.model _337 = [argument_2, argument_3, argument_4, argument_5, argument_6, _306, _321, argument_9, _336, argument_11, argument_12, _216, argument_14, argument_15, _261, _276, argument_18, argument_19, argument_20, argument_21, argument_22, argument_23, argument_24] _338 = ops.neuron.forward_v2_1(_337, model) ~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE return _338 Traceback of TorchScript, original code (most recent call last): /root/miniconda/lib/python3.9/site-packages/torch/_ops.py(442): __call__ /root/miniconda/lib/python3.9/site-packages/torch_neuron/decorators.py(416): forward /root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1182): _slow_forward /root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1194): _call_impl /root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(580): __call__ /root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(209): run_op /root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(198): __call__ /root/miniconda/lib/python3.9/site-packages/torch_neuron/runtime.py(69): forward /root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1182): _slow_forward /root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1194): _call_impl /root/miniconda/lib/python3.9/site-packages/torch/jit/_trace.py(976): trace_module /root/miniconda/lib/python3.9/site-packages/torch/jit/_trace.py(759): trace /root/miniconda/lib/python3.9/site-packages/torch_neuron/tensorboard.py(324): tb_parse /root/miniconda/lib/python3.9/site-packages/torch_neuron/tensorboard.py(550): tb_graph /root/miniconda/lib/python3.9/site-packages/torch_neuron/decorators.py(526): maybe_generate_tb_graph_def /root/miniconda/lib/python3.9/site-packages/torch_neuron/convert.py(580): maybe_determine_names_from_tensorboard /root/miniconda/lib/python3.9/site-packages/torch_neuron/convert.py(233): trace /root/miniconda/lib/python3.9/site-packages/DeepEngine/tools/convert_onnx_to_torch_neuron.py(62): convert_onnx2neuron /data1/zhuxinyu/GitLab/DeepEngine/scripts/convert/convert_neuron_from_s3.py(24): run /data1/zhuxinyu/GitLab/DeepEngine/scripts/convert/convert_neuron_from_s3.py(67): <module> RuntimeError: Failed to execute the model status=5 message=Timeout Exceeded [2024-11-06 06:40:46,642][pid=27][WARNING] re-try times 1 2024-Nov-06 06:40:46.0647 2894:2894 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (end:2) 2024-Nov-06 06:40:46.0647 2894:2894 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (end:0) 2024-Nov-06 06:40:46.0647 2894:2894 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (end:1) 2024-Nov-06 06:40:46.0647 2894:2894 ERROR TDRV:consume_model_start_extra_notifications_v1(FATAL-RT-UNDEFINED-STATE) model start timeout (2000 ms) on Neuron Device 0 NC 0, waiting for execution completion notification 2024-Nov-06 06:40:46.0647 2894:2894 ERROR NMGR:dlr_kelf_start_no_lock Model (1001) start failed for VNC=0, ret: 5 2024-Nov-06 06:40:46.0647 2894:2894 ERROR NMGR:tpbs_infer_lock Failed to start model 2024-Nov-06 06:40:46.0647 2894:2894 ERROR NMGR:dlr_infer Failed to acquire infer locks [2024-11-06 06:40:46,661][pid=27][ERROR] error is The following operation failed in the TorchScript interpreter. Traceback of TorchScript, serialized code (most recent call last): File "code/__torch__/torch_neuron/runtime/___torch_mangle_3435.py", line 373, in forward model = _NeuronGraph_1981.model _337 = [argument_2, argument_3, argument_4, argument_5, argument_6, _306, _321, argument_9, _336, argument_11, argument_12, _216, argument_14, argument_15, _261, _276, argument_18, argument_19, argument_20, argument_21, argument_22, argument_23, argument_24] _338 = ops.neuron.forward_v2_1(_337, model) ~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE return _338 Traceback of TorchScript, original code (most recent call last): /root/miniconda/lib/python3.9/site-packages/torch/_ops.py(442): __call__ /root/miniconda/lib/python3.9/site-packages/torch_neuron/decorators.py(416): forward /root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1182): _slow_forward /root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1194): _call_impl /root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(580): __call__ /root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(209): run_op /root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(198): __call__ /root/miniconda/lib/python3.9/site-packages/torch_neuron/runtime.py(69): forward /root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1182): _slow_forward /root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1194): _call_impl /root/miniconda/lib/python3.9/site-packages/torch/jit/_trace.py(976): trace_module /root/miniconda/lib/python3.9/site-packages/torch/jit/_trace.py(759): trace /root/miniconda/lib/python3.9/site-packages/torch_neuron/tensorboard.py(324): tb_parse /root/miniconda/lib/python3.9/site-packages/torch_neuron/tensorboard.py(550): tb_graph /root/miniconda/lib/python3.9/site-packages/torch_neuron/decorators.py(526): maybe_generate_tb_graph_def /root/miniconda/lib/python3.9/site-packages/torch_neuron/convert.py(580): maybe_determine_names_from_tensorboard /root/miniconda/lib/python3.9/site-packages/torch_neuron/convert.py(233): trace /root/miniconda/lib/python3.9/site-packages/DeepEngine/tools/convert_onnx_to_torch_neuron.py(62): convert_onnx2neuron /data1/zhuxinyu/GitLab/DeepEngine/scripts/convert/convert_neuron_from_s3.py(24): run /data1/zhuxinyu/GitLab/DeepEngine/scripts/convert/convert_neuron_from_s3.py(67): <module> RuntimeError: Failed to execute the model status=5 message=Timeout Exceeded , trace back log Traceback (most recent call last): File "/root/miniconda/lib/python3.9/site-packages/DeepEngine/engines/engine.py", line 43, in __call__ return self.forward(*xargs, **kwargs) File "/root/miniconda/lib/python3.9/site-packages/DeepEngine/engines/TorchNeuron.py", line 135, in forward ret = self.model(*args) File "/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) RuntimeError: The following operation failed in the TorchScript interpreter. Traceback of TorchScript, serialized code (most recent call last): File "code/__torch__/torch_neuron/runtime/___torch_mangle_3435.py", line 373, in forward model = _NeuronGraph_1981.model _337 = [argument_2, argument_3, argument_4, argument_5, argument_6, _306, _321, argument_9, _336, argument_11, argument_12, _216, argument_14, argument_15, _261, _276, argument_18, argument_19, argument_20, argument_21, argument_22, argument_23, argument_24] _338 = ops.neuron.forward_v2_1(_337, model) ~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE return _338 Traceback of TorchScript, original code (most recent call last): /root/miniconda/lib/python3.9/site-packages/torch/_ops.py(442): __call__ /root/miniconda/lib/python3.9/site-packages/torch_neuron/decorators.py(416): forward /root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1182): _slow_forward /root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1194): _call_impl /root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(580): __call__ /root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(209): run_op /root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(198): __call__ /root/miniconda/lib/python3.9/site-packages/torch_neuron/runtime.py(69): forward /root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1182): _slow_forward /root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1194): _call_impl /root/miniconda/lib/python3.9/site-packages/torch/jit/_trace.py(976): trace_module /root/miniconda/lib/python3.9/site-packages/torch/jit/_trace.py(759): trace /root/miniconda/lib/python3.9/site-packages/torch_neuron/tensorboard.py(324): tb_parse /root/miniconda/lib/python3.9/site-packages/torch_neuron/tensorboard.py(550): tb_graph /root/miniconda/lib/python3.9/site-packages/torch_neuron/decorators.py(526): maybe_generate_tb_graph_def /root/miniconda/lib/python3.9/site-packages/torch_neuron/convert.py(580): maybe_determine_names_from_tensorboard /root/miniconda/lib/python3.9/site-packages/torch_neuron/convert.py(233): trace /root/miniconda/lib/python3.9/site-packages/torch_neuron/convert.py(233): trace /root/miniconda/lib/python3.9/site-packages/DeepEngine/tools/convert_onnx_to_torch_neuron.py(62): convert_onnx2neuron /data1/zhuxinyu/GitLab/DeepEngine/scripts/convert/convert_neuron_from_s3.py(24): run /data1/zhuxinyu/GitLab/DeepEngine/scripts/convert/convert_neuron_from_s3.py(67): <module> RuntimeError: Failed to execute the model status=5 message=Timeout Exceeded [2024-11-06 06:40:46,661][pid=27][WARNING] re-try times 1 [2024-11-06 06:40:52,030][pid=28][INFO] No message. Put worker to sleep for a while... [2024-11-06 06:40:58,041][pid=28][INFO] current sqs message num is 0 [2024-11-06 06:41:08,115][pid=28][INFO] No message. Put worker to sleep for a while... [2024-11-06 06:41:14,125][pid=28][INFO] current sqs message num is 0 2024-Nov-06 06:41:16.0646 2895:2895 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (start:2) 2024-Nov-06 06:41:16.0646 2895:2895 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (start:0) 2024-Nov-06 06:41:16.0646 2895:2895 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (start:1) 2024-Nov-06 06:41:16.0646 2895:2895 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (end:0) 2024-Nov-06 06:41:16.0646 2895:2895 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (end:1) 2024-Nov-06 06:41:16.0646 2895:2895 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (end:2) 2024-Nov-06 06:41:16.0646 2895:2895 ERROR TDRV:exec_consume_infer_status_notifications (FATAL-RT-UNDEFINED-STATE) inference timeout (30000 ms) on Neuron Device 0 NC 1, waiting for execution completion notification 2024-Nov-06 06:41:16.0646 2895:2895 ERROR TDRV:notification_consume_error_block Error notifications found on nd0 nc1; action=INFER_ERROR_SUBTYPE_MODEL; error_id=8; error string:Event double set 2024-Nov-06 06:41:16.0646 2895:2895 ERROR NMGR:dlr_infer Inference completed with err: 5 [2024-11-06 06:41:16,650][pid=27][ERROR] error is The following operation failed in the TorchScript interpreter. Traceback of TorchScript, serialized code (most recent call last): File "code/__torch__/torch_neuron/runtime/___torch_mangle_3435.py", line 373, in forward model = _NeuronGraph_1981.model _337 = [argument_2, argument_3, argument_4, argument_5, argument_6, _306, _321, argument_9, _336, argument_11, argument_12, _216, argument_14, argument_15, _261, _276, argument_18, argument_19, argument_20, argument_21, argument_22, argument_23, argument_24] _338 = ops.neuron.forward_v2_1(_337, model) ~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE return _338 Traceback of TorchScript, original code (most recent call last): /root/miniconda/lib/python3.9/site-packages/torch/_ops.py(442): __call__ /root/miniconda/lib/python3.9/site-packages/torch_neuron/decorators.py(416): forward /root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1182): _slow_forward /root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1194): _call_impl /root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(580): __call__ /root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(209): run_op /root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(198): __call__ /root/miniconda/lib/python3.9/site-packages/torch_neuron/runtime.py(69): forward /root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1182): _slow_forward /root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1194): _call_impl /root/miniconda/lib/python3.9/site-packages/torch/jit/_trace.py(976): trace_module /root/miniconda/lib/python3.9/site-packages/torch/jit/_trace.py(759): trace /root/miniconda/lib/python3.9/site-packages/torch_neuron/tensorboard.py(324): tb_parse /root/miniconda/lib/python3.9/site-packages/torch_neuron/tensorboard.py(550): tb_graph /root/miniconda/lib/python3.9/site-packages/torch_neuron/decorators.py(526): maybe_generate_tb_graph_def /root/miniconda/lib/python3.9/site-packages/torch_neuron/convert.py(580): maybe_determine_names_from_tensorboard /root/miniconda/lib/python3.9/site-packages/torch_neuron/convert.py(233): trace /root/miniconda/lib/python3.9/site-packages/DeepEngine/tools/convert_onnx_to_torch_neuron.py(62): convert_onnx2neuron /data1/zhuxinyu/GitLab/DeepEngine/scripts/convert/convert_neuron_from_s3.py(24): run /data1/zhuxinyu/GitLab/DeepEngine/scripts/convert/convert_neuron_from_s3.py(67): <module> RuntimeError: Failed to execute the model status=5 message=Timeout Exceeded , trace back log Traceback (most recent call last): File "/root/miniconda/lib/python3.9/site-packages/DeepEngine/engines/engine.py", line 43, in __call__ return self.forward(*xargs, **kwargs) File "/root/miniconda/lib/python3.9/site-packages/DeepEngine/engines/TorchNeuron.py", line 135, in forward ret = self.model(*args) File "/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) RuntimeError: The following operation failed in the TorchScript interpreter. Traceback of TorchScript, serialized code (most recent call last): File "code/__torch__/torch_neuron/runtime/___torch_mangle_3435.py", line 373, in forward model = _NeuronGraph_1981.model _337 = [argument_2, argument_3, argument_4, argument_5, argument_6, _306, _321, argument_9, _336, argument_11, argument_12, _216, argument_14, argument_15, _261, _276, argument_18, argument_19, argument_20, argument_21, argument_22, argument_23, argument_24] _338 = ops.neuron.forward_v2_1(_337, model) ~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE return _338 Traceback of TorchScript, original code (most recent call last): /root/miniconda/lib/python3.9/site-packages/torch/_ops.py(442): __call__ /root/miniconda/lib/python3.9/site-packages/torch_neuron/decorators.py(416): forward /root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1182): _slow_forward /root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1194): _call_impl /root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(580): __call__ /root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(209): run_op /root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(198): __call__ /root/miniconda/lib/python3.9/site-packages/torch_neuron/runtime.py(69): forward /root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1182): _slow_forward /root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1194): _call_impl /root/miniconda/lib/python3.9/site-packages/torch/jit/_trace.py(976): trace_module /root/miniconda/lib/python3.9/site-packages/torch/jit/_trace.py(759): trace /root/miniconda/lib/python3.9/site-packages/torch_neuron/tensorboard.py(324): tb_parse /root/miniconda/lib/python3.9/site-packages/torch_neuron/tensorboard.py(550): tb_graph /root/miniconda/lib/python3.9/site-packages/torch_neuron/decorators.py(526): maybe_generate_tb_graph_def /root/miniconda/lib/python3.9/site-packages/torch_neuron/convert.py(580): maybe_determine_names_from_tensorboard /root/miniconda/lib/python3.9/site-packages/torch_neuron/convert.py(233): trace /root/miniconda/lib/python3.9/site-packages/DeepEngine/tools/convert_onnx_to_torch_neuron.py(62): convert_onnx2neuron /data1/zhuxinyu/GitLab/DeepEngine/scripts/convert/convert_neuron_from_s3.py(24): run /data1/zhuxinyu/GitLab/DeepEngine/scripts/convert/convert_neuron_from_s3.py(67): <module> RuntimeError: Failed to execute the model status=5 message=Timeout Exceeded
Hi,
Inf1 failed to execute the model after a long time. Here is the logs: