aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
https://aws.amazon.com/machine-learning/neuron/
Other
465 stars 154 forks source link

[Inf1] RuntimeError: Failed to execute the model status=5 message=Timeout Exceeded #1028

Open PigletOS opened 2 weeks ago

PigletOS commented 2 weeks ago

Hi,

Inf1 failed to execute the model after a long time. Here is the logs:

2024-Nov-06 06:40:46.0629 2895:2895 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (end:2)
2024-Nov-06 06:40:46.0629 2895:2895 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (end:0)
2024-Nov-06 06:40:46.0629 2895:2895 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (end:1)
2024-Nov-06 06:40:46.0629 2895:2895 ERROR TDRV:consume_model_start_extra_notifications_v1(FATAL-RT-UNDEFINED-STATE) model start timeout (2000 ms) on Neuron Device 0 NC 1, waiting for execution completion notification
2024-Nov-06 06:40:46.0629 2895:2895 ERROR NMGR:dlr_kelf_start_no_lock Model (1001) start failed for VNC=0, ret: 5
2024-Nov-06 06:40:46.0629 2895:2895 ERROR NMGR:tpbs_infer_lock Failed to start model
2024-Nov-06 06:40:46.0629 2895:2895 ERROR NMGR:dlr_infer Failed to acquire infer locks
[2024-11-06 06:40:46,642][pid=27][ERROR] error is The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
File "code/__torch__/torch_neuron/runtime/___torch_mangle_3435.py", line 373, in forward
model = _NeuronGraph_1981.model
_337 = [argument_2, argument_3, argument_4, argument_5, argument_6, _306, _321, argument_9, _336, argument_11, argument_12, _216, argument_14, argument_15, _261, _276, argument_18, argument_19, argument_20, argument_21, argument_22, argument_23, argument_24]
_338 = ops.neuron.forward_v2_1(_337, model)
~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
return _338
Traceback of TorchScript, original code (most recent call last):
/root/miniconda/lib/python3.9/site-packages/torch/_ops.py(442): __call__
/root/miniconda/lib/python3.9/site-packages/torch_neuron/decorators.py(416): forward
/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1182): _slow_forward
/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1194): _call_impl
/root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(580): __call__
/root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(209): run_op
/root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(198): __call__
/root/miniconda/lib/python3.9/site-packages/torch_neuron/runtime.py(69): forward
/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1182): _slow_forward
/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1194): _call_impl
/root/miniconda/lib/python3.9/site-packages/torch/jit/_trace.py(976): trace_module
/root/miniconda/lib/python3.9/site-packages/torch/jit/_trace.py(759): trace
/root/miniconda/lib/python3.9/site-packages/torch_neuron/tensorboard.py(324): tb_parse
/root/miniconda/lib/python3.9/site-packages/torch_neuron/tensorboard.py(550): tb_graph
/root/miniconda/lib/python3.9/site-packages/torch_neuron/decorators.py(526): maybe_generate_tb_graph_def
/root/miniconda/lib/python3.9/site-packages/torch_neuron/convert.py(580): maybe_determine_names_from_tensorboard
/root/miniconda/lib/python3.9/site-packages/torch_neuron/convert.py(233): trace
/root/miniconda/lib/python3.9/site-packages/DeepEngine/tools/convert_onnx_to_torch_neuron.py(62): convert_onnx2neuron
/data1/zhuxinyu/GitLab/DeepEngine/scripts/convert/convert_neuron_from_s3.py(24): run
/data1/zhuxinyu/GitLab/DeepEngine/scripts/convert/convert_neuron_from_s3.py(67): <module>
RuntimeError: Failed to execute the model status=5 message=Timeout Exceeded
, trace back log Traceback (most recent call last):
File "/root/miniconda/lib/python3.9/site-packages/DeepEngine/engines/engine.py", line 43, in __call__
return self.forward(*xargs, **kwargs)
File "/root/miniconda/lib/python3.9/site-packages/DeepEngine/engines/TorchNeuron.py", line 135, in forward
ret = self.model(*args)
File "/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
File "code/__torch__/torch_neuron/runtime/___torch_mangle_3435.py", line 373, in forward
model = _NeuronGraph_1981.model
_337 = [argument_2, argument_3, argument_4, argument_5, argument_6, _306, _321, argument_9, _336, argument_11, argument_12, _216, argument_14, argument_15, _261, _276, argument_18, argument_19, argument_20, argument_21, argument_22, argument_23, argument_24]
_338 = ops.neuron.forward_v2_1(_337, model)
~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
return _338
Traceback of TorchScript, original code (most recent call last):
/root/miniconda/lib/python3.9/site-packages/torch/_ops.py(442): __call__
/root/miniconda/lib/python3.9/site-packages/torch_neuron/decorators.py(416): forward
/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1182): _slow_forward
/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1194): _call_impl
/root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(580): __call__
/root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(209): run_op
/root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(198): __call__
/root/miniconda/lib/python3.9/site-packages/torch_neuron/runtime.py(69): forward
/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1182): _slow_forward
/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1194): _call_impl
/root/miniconda/lib/python3.9/site-packages/torch/jit/_trace.py(976): trace_module
/root/miniconda/lib/python3.9/site-packages/torch/jit/_trace.py(759): trace
/root/miniconda/lib/python3.9/site-packages/torch_neuron/tensorboard.py(324): tb_parse
/root/miniconda/lib/python3.9/site-packages/torch_neuron/tensorboard.py(550): tb_graph
/root/miniconda/lib/python3.9/site-packages/torch_neuron/decorators.py(526): maybe_generate_tb_graph_def
/root/miniconda/lib/python3.9/site-packages/torch_neuron/convert.py(580): maybe_determine_names_from_tensorboard
/root/miniconda/lib/python3.9/site-packages/torch_neuron/convert.py(233): trace
/root/miniconda/lib/python3.9/site-packages/DeepEngine/tools/convert_onnx_to_torch_neuron.py(62): convert_onnx2neuron
/data1/zhuxinyu/GitLab/DeepEngine/scripts/convert/convert_neuron_from_s3.py(24): run
/data1/zhuxinyu/GitLab/DeepEngine/scripts/convert/convert_neuron_from_s3.py(67): <module>
RuntimeError: Failed to execute the model status=5 message=Timeout Exceeded
[2024-11-06 06:40:46,642][pid=27][WARNING] re-try times 1
2024-Nov-06 06:40:46.0647 2894:2894 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (end:2)
2024-Nov-06 06:40:46.0647 2894:2894 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (end:0)
2024-Nov-06 06:40:46.0647 2894:2894 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (end:1)
2024-Nov-06 06:40:46.0647 2894:2894 ERROR TDRV:consume_model_start_extra_notifications_v1(FATAL-RT-UNDEFINED-STATE) model start timeout (2000 ms) on Neuron Device 0 NC 0, waiting for execution completion notification
2024-Nov-06 06:40:46.0647 2894:2894 ERROR NMGR:dlr_kelf_start_no_lock Model (1001) start failed for VNC=0, ret: 5
2024-Nov-06 06:40:46.0647 2894:2894 ERROR NMGR:tpbs_infer_lock Failed to start model
2024-Nov-06 06:40:46.0647 2894:2894 ERROR NMGR:dlr_infer Failed to acquire infer locks
[2024-11-06 06:40:46,661][pid=27][ERROR] error is The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
File "code/__torch__/torch_neuron/runtime/___torch_mangle_3435.py", line 373, in forward
model = _NeuronGraph_1981.model
_337 = [argument_2, argument_3, argument_4, argument_5, argument_6, _306, _321, argument_9, _336, argument_11, argument_12, _216, argument_14, argument_15, _261, _276, argument_18, argument_19, argument_20, argument_21, argument_22, argument_23, argument_24]
_338 = ops.neuron.forward_v2_1(_337, model)
~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
return _338
Traceback of TorchScript, original code (most recent call last):
/root/miniconda/lib/python3.9/site-packages/torch/_ops.py(442): __call__
/root/miniconda/lib/python3.9/site-packages/torch_neuron/decorators.py(416): forward
/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1182): _slow_forward
/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1194): _call_impl
/root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(580): __call__
/root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(209): run_op
/root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(198): __call__
/root/miniconda/lib/python3.9/site-packages/torch_neuron/runtime.py(69): forward
/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1182): _slow_forward
/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1194): _call_impl
/root/miniconda/lib/python3.9/site-packages/torch/jit/_trace.py(976): trace_module
/root/miniconda/lib/python3.9/site-packages/torch/jit/_trace.py(759): trace
/root/miniconda/lib/python3.9/site-packages/torch_neuron/tensorboard.py(324): tb_parse
/root/miniconda/lib/python3.9/site-packages/torch_neuron/tensorboard.py(550): tb_graph
/root/miniconda/lib/python3.9/site-packages/torch_neuron/decorators.py(526): maybe_generate_tb_graph_def
/root/miniconda/lib/python3.9/site-packages/torch_neuron/convert.py(580): maybe_determine_names_from_tensorboard
/root/miniconda/lib/python3.9/site-packages/torch_neuron/convert.py(233): trace
/root/miniconda/lib/python3.9/site-packages/DeepEngine/tools/convert_onnx_to_torch_neuron.py(62): convert_onnx2neuron
/data1/zhuxinyu/GitLab/DeepEngine/scripts/convert/convert_neuron_from_s3.py(24): run
/data1/zhuxinyu/GitLab/DeepEngine/scripts/convert/convert_neuron_from_s3.py(67): <module>
RuntimeError: Failed to execute the model status=5 message=Timeout Exceeded
, trace back log Traceback (most recent call last):
File "/root/miniconda/lib/python3.9/site-packages/DeepEngine/engines/engine.py", line 43, in __call__
return self.forward(*xargs, **kwargs)
File "/root/miniconda/lib/python3.9/site-packages/DeepEngine/engines/TorchNeuron.py", line 135, in forward
ret = self.model(*args)
File "/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
File "code/__torch__/torch_neuron/runtime/___torch_mangle_3435.py", line 373, in forward
model = _NeuronGraph_1981.model
_337 = [argument_2, argument_3, argument_4, argument_5, argument_6, _306, _321, argument_9, _336, argument_11, argument_12, _216, argument_14, argument_15, _261, _276, argument_18, argument_19, argument_20, argument_21, argument_22, argument_23, argument_24]
_338 = ops.neuron.forward_v2_1(_337, model)
~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
return _338
Traceback of TorchScript, original code (most recent call last):
/root/miniconda/lib/python3.9/site-packages/torch/_ops.py(442): __call__
/root/miniconda/lib/python3.9/site-packages/torch_neuron/decorators.py(416): forward
/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1182): _slow_forward
/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1194): _call_impl
/root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(580): __call__
/root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(209): run_op
/root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(198): __call__
/root/miniconda/lib/python3.9/site-packages/torch_neuron/runtime.py(69): forward
/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1182): _slow_forward
/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1194): _call_impl
/root/miniconda/lib/python3.9/site-packages/torch/jit/_trace.py(976): trace_module
/root/miniconda/lib/python3.9/site-packages/torch/jit/_trace.py(759): trace
/root/miniconda/lib/python3.9/site-packages/torch_neuron/tensorboard.py(324): tb_parse
/root/miniconda/lib/python3.9/site-packages/torch_neuron/tensorboard.py(550): tb_graph
/root/miniconda/lib/python3.9/site-packages/torch_neuron/decorators.py(526): maybe_generate_tb_graph_def
/root/miniconda/lib/python3.9/site-packages/torch_neuron/convert.py(580): maybe_determine_names_from_tensorboard
/root/miniconda/lib/python3.9/site-packages/torch_neuron/convert.py(233): trace
/root/miniconda/lib/python3.9/site-packages/torch_neuron/convert.py(233): trace

/root/miniconda/lib/python3.9/site-packages/DeepEngine/tools/convert_onnx_to_torch_neuron.py(62): convert_onnx2neuron
/data1/zhuxinyu/GitLab/DeepEngine/scripts/convert/convert_neuron_from_s3.py(24): run
/data1/zhuxinyu/GitLab/DeepEngine/scripts/convert/convert_neuron_from_s3.py(67): <module>
RuntimeError: Failed to execute the model status=5 message=Timeout Exceeded
[2024-11-06 06:40:46,661][pid=27][WARNING] re-try times 1
[2024-11-06 06:40:52,030][pid=28][INFO] No message. Put worker to sleep for a while...
[2024-11-06 06:40:58,041][pid=28][INFO] current sqs message num is 0
[2024-11-06 06:41:08,115][pid=28][INFO] No message. Put worker to sleep for a while...
[2024-11-06 06:41:14,125][pid=28][INFO] current sqs message num is 0
2024-Nov-06 06:41:16.0646 2895:2895 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (start:2)
2024-Nov-06 06:41:16.0646 2895:2895 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (start:0)
2024-Nov-06 06:41:16.0646 2895:2895 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (start:1)
2024-Nov-06 06:41:16.0646 2895:2895 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (end:0)
2024-Nov-06 06:41:16.0646 2895:2895 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (end:1)
2024-Nov-06 06:41:16.0646 2895:2895 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (end:2)
2024-Nov-06 06:41:16.0646 2895:2895 ERROR TDRV:exec_consume_infer_status_notifications (FATAL-RT-UNDEFINED-STATE) inference timeout (30000 ms) on Neuron Device 0 NC 1, waiting for execution completion notification
2024-Nov-06 06:41:16.0646 2895:2895 ERROR TDRV:notification_consume_error_block Error notifications found on nd0 nc1; action=INFER_ERROR_SUBTYPE_MODEL; error_id=8; error string:Event double set
2024-Nov-06 06:41:16.0646 2895:2895 ERROR NMGR:dlr_infer Inference completed with err: 5
[2024-11-06 06:41:16,650][pid=27][ERROR] error is The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
File "code/__torch__/torch_neuron/runtime/___torch_mangle_3435.py", line 373, in forward
model = _NeuronGraph_1981.model
_337 = [argument_2, argument_3, argument_4, argument_5, argument_6, _306, _321, argument_9, _336, argument_11, argument_12, _216, argument_14, argument_15, _261, _276, argument_18, argument_19, argument_20, argument_21, argument_22, argument_23, argument_24]
_338 = ops.neuron.forward_v2_1(_337, model)
~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
return _338
Traceback of TorchScript, original code (most recent call last):
/root/miniconda/lib/python3.9/site-packages/torch/_ops.py(442): __call__
/root/miniconda/lib/python3.9/site-packages/torch_neuron/decorators.py(416): forward
/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1182): _slow_forward
/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1194): _call_impl
/root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(580): __call__
/root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(209): run_op
/root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(198): __call__
/root/miniconda/lib/python3.9/site-packages/torch_neuron/runtime.py(69): forward
/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1182): _slow_forward
/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1194): _call_impl
/root/miniconda/lib/python3.9/site-packages/torch/jit/_trace.py(976): trace_module
/root/miniconda/lib/python3.9/site-packages/torch/jit/_trace.py(759): trace
/root/miniconda/lib/python3.9/site-packages/torch_neuron/tensorboard.py(324): tb_parse
/root/miniconda/lib/python3.9/site-packages/torch_neuron/tensorboard.py(550): tb_graph
/root/miniconda/lib/python3.9/site-packages/torch_neuron/decorators.py(526): maybe_generate_tb_graph_def
/root/miniconda/lib/python3.9/site-packages/torch_neuron/convert.py(580): maybe_determine_names_from_tensorboard
/root/miniconda/lib/python3.9/site-packages/torch_neuron/convert.py(233): trace
/root/miniconda/lib/python3.9/site-packages/DeepEngine/tools/convert_onnx_to_torch_neuron.py(62): convert_onnx2neuron
/data1/zhuxinyu/GitLab/DeepEngine/scripts/convert/convert_neuron_from_s3.py(24): run
/data1/zhuxinyu/GitLab/DeepEngine/scripts/convert/convert_neuron_from_s3.py(67): <module>
RuntimeError: Failed to execute the model status=5 message=Timeout Exceeded
, trace back log Traceback (most recent call last):
File "/root/miniconda/lib/python3.9/site-packages/DeepEngine/engines/engine.py", line 43, in __call__
return self.forward(*xargs, **kwargs)
File "/root/miniconda/lib/python3.9/site-packages/DeepEngine/engines/TorchNeuron.py", line 135, in forward
ret = self.model(*args)
File "/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
File "code/__torch__/torch_neuron/runtime/___torch_mangle_3435.py", line 373, in forward
model = _NeuronGraph_1981.model
_337 = [argument_2, argument_3, argument_4, argument_5, argument_6, _306, _321, argument_9, _336, argument_11, argument_12, _216, argument_14, argument_15, _261, _276, argument_18, argument_19, argument_20, argument_21, argument_22, argument_23, argument_24]
_338 = ops.neuron.forward_v2_1(_337, model)
~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
return _338
Traceback of TorchScript, original code (most recent call last):
/root/miniconda/lib/python3.9/site-packages/torch/_ops.py(442): __call__
/root/miniconda/lib/python3.9/site-packages/torch_neuron/decorators.py(416): forward
/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1182): _slow_forward
/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1194): _call_impl
/root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(580): __call__
/root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(209): run_op
/root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(198): __call__
/root/miniconda/lib/python3.9/site-packages/torch_neuron/runtime.py(69): forward
/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1182): _slow_forward
/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1194): _call_impl
/root/miniconda/lib/python3.9/site-packages/torch/jit/_trace.py(976): trace_module
/root/miniconda/lib/python3.9/site-packages/torch/jit/_trace.py(759): trace
/root/miniconda/lib/python3.9/site-packages/torch_neuron/tensorboard.py(324): tb_parse
/root/miniconda/lib/python3.9/site-packages/torch_neuron/tensorboard.py(550): tb_graph
/root/miniconda/lib/python3.9/site-packages/torch_neuron/decorators.py(526): maybe_generate_tb_graph_def
/root/miniconda/lib/python3.9/site-packages/torch_neuron/convert.py(580): maybe_determine_names_from_tensorboard
/root/miniconda/lib/python3.9/site-packages/torch_neuron/convert.py(233): trace
/root/miniconda/lib/python3.9/site-packages/DeepEngine/tools/convert_onnx_to_torch_neuron.py(62): convert_onnx2neuron
/data1/zhuxinyu/GitLab/DeepEngine/scripts/convert/convert_neuron_from_s3.py(24): run
/data1/zhuxinyu/GitLab/DeepEngine/scripts/convert/convert_neuron_from_s3.py(67): <module>
RuntimeError: Failed to execute the model status=5 message=Timeout Exceeded