aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
https://aws.amazon.com/machine-learning/neuron/
Other
421 stars 136 forks source link

Error on serving model using tensorflow_model_server_neuron #823

Closed mostafafarzaneh closed 5 months ago

mostafafarzaneh commented 5 months ago

I compiled my model with tfnx.trace and served it using tensorflow_model_server_neuron. I followed this and installed tensorflow-model-server-neuronx. The server is inf2.xlarge.

Here is the server error:

2024-Jan-19 05:03:34.343999 31898:32693 ERROR NEFF:neff_parse NEFF version: 2.0, features: 0x100 are not supported. Currently supporting: 0x80000000000000ff 2024-Jan-19 05:03:34.344048 31898:32693 ERROR NMGR:kmgr_load_nn_post_metrics Failed to load NN: /tmp/tmpfbnn2xkk/hlo_module.neff, err: 10 2024-01-19 05:03:34.344098: W external/org_tensorflow/tensorflow/core/framework/op_kernel.cc:1780] OP_REQUIRES failed at neuron_op.cc:

mostafafarzaneh commented 5 months ago

NeuronX Compiler version 2.12.68.0+4480452af

Python version 3.8.10 HWM version 2.12.0.0-422c9037c NumPy version 1.24.4

Running on AMI ami-0b64b64d4bd7327ee Running in region usw2-az3

awsilya commented 5 months ago

@mostafafarzaneh looks like you an old driver/runtime on your instance. Please follow the instructions here: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/tensorflow-neuronx.html#setup-tensorflow-neuronx

Look under "install driver and tools"

mostafafarzaneh commented 5 months ago

@awsilya I'm using the ami-0b64b64d4bd7327ee(Deep Learning Base Neuron AMI (Ubuntu 20.04) 20231211). So It should be up to date. Nevertheless, I did update the driver/runtime but I still get the same error

mostafafarzaneh commented 5 months ago

I changed the ami to ami-02515bacfc7fafa88(Deep Learning AMI Neuron TensorFlow 2.10 (Ubuntu 20.04) 20240102) and now the problem is gone.