aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
https://aws.amazon.com/machine-learning/neuron/
Other
420 stars 136 forks source link

RuntimeError: Bad StatusOr access: INVALID_ARGUMENT: PJRT_Client_Create: error condition nullptr != (args)->client->Error(): Init: error condition !(num_devices > 0): #902

Open PrateekAg1511 opened 3 weeks ago

PrateekAg1511 commented 3 weeks ago

Hi,

I am facing this error when trying to trace a model using torch_neuronx 2.1.

RuntimeError: Bad StatusOr access: INVALID_ARGUMENT: PJRT_Client_Create: error condition nullptr != (args)->client->Error(): Init: error condition !(num_devices > 0):

Packages:

torch_neuronx : '2.1.2.2.1.0'

neuron-cc: NeuronX Compiler version 2.11.0.34+c5231f848

Python version 3.10.12 HWM version 2.11.0.2-e34678757 NumPy version 1.23.5

torch : '2.1.2+cu121'

torch_xla: '2.1.2'

Can some help me debug this ?

awsilya commented 2 weeks ago

@PrateekAg1511 are you running on trn1/inf2 instance ? Do you have the rest of the Neuron SKD installed?

this: num_devices > 0 looks like neuron driver is not installed. Did you follow setup steps? https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/torch-neuronx.html#setup-torch-neuronx

PrateekAg1511 commented 2 weeks ago

@awsilya I am using the AWS SageMaker Neuron Image. It works fine when I use torch neuronx 1.3.

When I upgrade it to torch neuronx 2.1, I am getting this error.

the reason for moving to neuronx 2.1 is that when using neuronx 1.3 , I am getting warning that input tensors are not being used.

jluntamazon commented 2 weeks ago

When I upgrade it to torch neuronx 2.1, I am getting this error.

The error you are running into is indicating that the frontend framework cannot find any Neuron devices on the instance. This is either because the instance type does not have any NeuronCores available (only trn1/inf2-type instances expose these devices) or because the driver is not installed.

Can you confirm if the NeuronCores are accessible by using the neuron-ls command-line tool?

the reason for moving to neuronx 2.1 is that when using neuronx 1.3 , I am getting warning that input tensors are not being used.

It is unlikely that moving to neuronx 2.1 will resolve your issue. However, it is still a good idea to validate that the NeuronCores are accessible before you begin testing the trace functionality