aws-neuron / aws-neuron-samples

Example code for AWS Neuron SDK developers building inference and training applications
Other
101 stars 32 forks source link

On torch-neuronx 2.1 Beta import xla_backend fails #69

Closed ajayvohra2005 closed 3 months ago

ajayvohra2005 commented 4 months ago

Error:

I am getting following error when importing xla_backend on torch-neuronx 2.1

## PackagesException has occurred: FileNotFoundError
[Errno 2] No such file or directory: 'libneuronpjrt-path'
  File "/home/ubuntu/efs/git/gpt2-fsdp/policies/wrapping.py", line 5, in <module>
    import torch_xla.distributed.xla_backend
  File "/home/ubuntu/efs/git/gpt2-fsdp/policies/__init__.py", line 2, in <module>
    from .wrapping import *
  File "/home/ubuntu/efs/git/gpt2-fsdp/train_fsdp.py", line 32, in <module>
    import policies
FileNotFoundError: [Errno 2] No such file or directory: 'libneuronpjrt-path'

Reproduce:

import torch_xla.distributed.xla_backend

Neuron OS packages

dpkg --list | grep neuron
ii  aws-neuronx-collectives                    2.20.11.0-c101c322e                      amd64        neuron_ccom built using CMake
ii  aws-neuronx-dkms                           2.15.9.0                                 amd64        aws-neuronx driver in DKMS format.
ii  aws-neuronx-oci-hook                       2.2.45.0                                 amd64        neuron_oci_hook built using CMake
ii  aws-neuronx-runtime-lib                    2.20.11.0-b7d33e68b                      amd64        neuron_runtime built using CMake
ii  aws-neuronx-tools                          2.17.0.0                                 amd64        Neuron profile and debug tools

Pip freeze

aws-neuronx-runtime-discovery==2.9
libneuronxla==2.0.755
neuronx-cc==2.12.68.0+4480452af
neuronx-hwm==2.12.0.0+422c9037c
torch-neuronx==2.1.1.2.0.1b0

Env

PJRT_DEVICE=NEURON

Hardware

trn1.32xlarge

OS

uname -a
Linux ip-172-31-53-77 6.2.0-1017-aws #17~22.04.1-Ubuntu SMP Fri Nov 17 21:07:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
aws-donkrets commented 4 months ago

Hi ajayvohra2005 - thx for reporting this. Looks like a potential missing package on your system or a config error. We are looking into it.

ajayvohra2005 commented 4 months ago

This only happens when the code is run in Visual Studio Code inside the virtual environment with torch 2.1.* installed. It does not happen when the code is run in a terminal.

jeffhataws commented 4 months ago

Hi @ajayvohra2005 ,

libneuronpjrt-path is a utility that is installed with libneuronxla and should be in <virtual env>/bin path. Please check if this path is in your VSCode settings.

(aws_neuron_venv_pytorch) ubuntu@ip-10-0-8-190:~$ which libneuronpjrt-path
/home/ubuntu/aws_neuron_venv_pytorch/bin/libneuronpjrt-path
(aws_neuron_venv_pytorch) ubuntu@ip-10-0-8-190:~$ libneuronpjrt-path 
/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/libneuronxla/libneuronpjrt.so
ajayvohra2005 commented 4 months ago

Yes, the library is there. As noted, the issue does not happen if the code is run in the terminal in the venv. Please try it in VS code and see if you can reproduce it. I selected the virtual environment in VS code by the usual Shift + Cmd + P -> Python: Select Interpreter -> Find the Python from the virtual environment ...

I tried directly specifying the venv path in the vS code settings as well.

jeffhataws commented 4 months ago

I was able to reproduce the issue. After I select the virtual environment (by the usual Shift + Cmd + P -> Python: Select Interpreter -> Find the path to Python from the virtual environment), I then see the error that you report. So I tried directly activate the environment in the terminal using "source /bin/activate", then I run the code and now there's no more error. Will you try this solution of activating the virtual environment in the terminal in addition to the usual selection of the interpreter?

jeffhataws commented 3 months ago

@ajayvohra2005 thanks for filing the issue. Let us know if you are still having problems after following the solution above. For now I will close the ticket.

ajayvohra2005 commented 3 months ago

I tried the suggested workaround and it works. This is not required for torch-neuronx==1.13, so curious why the change in behavior.