aws-neuron / aws-neuron-samples

Example code for AWS Neuron SDK developers building inference and training applications
Other
101 stars 32 forks source link

Dependencies conflict in running Llama-2-13b autoregressive sampling on Inf2 #47

Open mahendra-paranjpe opened 9 months ago

mahendra-paranjpe commented 9 months ago

Running notebook - https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/meta-llama-2-13b-sampling.ipynb on inf2.48xlarge

Error while running last block - line no 4 from transformers_neuronx.llama.model import LlamaForSampling

results in:

>>> from transformers_neuronx.llama.model import LlamaForSampling
2023-Sep-27 06:59:32.0474 22340:22340 ERROR  TDRV:tdrv_get_dev_info                       No neuron device available
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/aws_neuron_venv_pytorch/lib64/python3.7/site-packages/transformers_neuronx/llama/model.py", line 17, in <module>
    from transformers_neuronx import decoder
  File "/root/aws_neuron_venv_pytorch/lib64/python3.7/site-packages/transformers_neuronx/decoder.py", line 18, in <module>
    from transformers_neuronx import compiler
  File "/root/aws_neuron_venv_pytorch/lib64/python3.7/site-packages/transformers_neuronx/compiler.py", line 33, in <module>
    from libneuronxla import neuron_xla_compile
ImportError: cannot import name 'neuron_xla_compile' from 'libneuronxla' (/root/aws_neuron_venv_pytorch/lib64/python3.7/site-packages/libneuronxla/__init__.py)

https://github.com/huggingface/optimum-neuron/issues/213 - This suggests to update latest version of torch-neuronx. And https://github.com/aws-neuron/transformers-neuronx/issues/33 this suggest specific to torch-neuronx-1.13.1.1.10.0

When tried installing the specific version, it failed with following exception.

python -m pip install torch-neuronx==1.13.1.1.10.0 -U
Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting torch-neuronx==1.13.1.1.10.0
  Using cached https://pip.repos.neuron.amazonaws.com/torch-neuronx/torch_neuronx-1.13.1.1.10.0-py3-none-any.whl (2.4 MB)
Requirement already satisfied: torch==1.13.* in ./aws_neuron_venv_pytorch/lib/python3.7/site-packages (from torch-neuronx==1.13.1.1.10.0) (1.13.1)
INFO: pip is looking at multiple versions of torch-neuronx to determine which version is compatible with other requirements. This could take a while.
ERROR: Could not find a version that satisfies the requirement torch-xla==1.13.1+torchneurona (from torch-neuronx) (from versions: 1.0, 1.11.0+torchneuron2, 1.11.0+torchneuron3, 1.12.0+torchneuron3, 1.13.0+torchneuron3, 1.13.0+torchneuron4, 1.13.0+torchneuron5, 1.13.1+torchneuron6, 1.13.1+torchneuron7, 1.13.1+torchneuron8)
ERROR: No matching distribution found for torch-xla==1.13.1+torchneurona

Additional info on different versions available as of now.

pip index versions torch-neuronx
WARNING: pip index is currently an experimental command. It may be removed/changed in a future release without prior warning.
torch-neuronx (1.13.1.1.11.0)
Available versions: 1.13.1.1.11.0, 1.13.1.1.10.1, 1.13.1.1.10.0, 1.13.1.1.9.1, 1.13.1.1.9.0, 1.13.1.1.8.0, 1.13.1.1.7.0, 1.13.0.1.6.1, 1.13.0.1.6.0, 1.13.0.1.5.0, 1.13.0.1.4.0, 1.12.0.1.4.0, 1.11.0.1.2.0, 1.11.0.1.1.1, 1.0
  INSTALLED: 1.13.1.1.9.1
  LATEST:    1.13.1.1.11.0

pip index versions torch-xla
WARNING: pip index is currently an experimental command. It may be removed/changed in a future release without prior warning.
torch-xla (1.13.1+torchneuron8)
Available versions: 1.13.1+torchneuron8, 1.13.1+torchneuron7, 1.13.1+torchneuron6, 1.13.0+torchneuron5, 1.13.0+torchneuron4, 1.13.0+torchneuron3, 1.12.0+torchneuron3, 1.11.0+torchneuron3, 1.11.0+torchneuron2, 1.0
  INSTALLED: 1.13.1+torchneuron8
  LATEST:    1.13.1+torchneuron8

Following packages are installed

anyio==3.7.1
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
attrs==23.1.0
aws-neuronx-runtime-discovery==2.9
awscli==1.29.54
backcall==0.2.0
beautifulsoup4==4.12.2
bleach==6.0.0
boto3==1.28.54
botocore==1.31.54
cached-property==1.5.2
cachetools==5.3.1
certifi==2023.7.22
cffi==1.15.1
charset-normalizer==3.2.0
cloud-tpu-client==0.10
colorama==0.4.4
comm==0.1.4
debugpy==1.7.0
decorator==5.1.1
defusedxml==0.7.1
docutils==0.16
ec2-metadata==2.10.0
entrypoints==0.4
environment-kernels==1.2.0
exceptiongroup==1.1.3
fastjsonschema==2.18.0
filelock==3.12.2
fsspec==2023.1.0
google-api-core==1.34.0
google-api-python-client==1.8.0
google-auth==2.23.0
google-auth-httplib2==0.1.1
googleapis-common-protos==1.60.0
httplib2==0.22.0
huggingface-hub==0.16.4
idna==3.4
importlib-metadata==6.7.0
importlib-resources==5.12.0
iniconfig==2.0.0
ipykernel==6.16.2
ipython==7.34.0
ipython-genutils==0.2.0
ipywidgets==8.1.1
islpy==2022.2.1
jedi==0.19.0
Jinja2==3.1.2
jmespath==1.0.1
jsonschema==4.17.3
jupyter==1.0.0
jupyter-console==6.6.3
jupyter-server==1.24.0
jupyter_client==7.4.9
jupyter_core==4.12.0
jupyterlab-pygments==0.2.2
jupyterlab-widgets==3.0.9
libneuronxla==0.5.413
lockfile==0.12.2
MarkupSafe==2.1.3
matplotlib-inline==0.1.6
mistune==3.0.1
nbclassic==1.0.0
nbclient==0.7.4
nbconvert==7.6.0
nbformat==5.8.0
nest-asyncio==1.5.8
networkx==2.6.3
neuronx-cc==2.9.0.16+fa12ba55a
neuronx-hwm==2.9.0.1+f79d59e7b
notebook==6.5.6
notebook_shim==0.2.3
numpy==1.21.6
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
oauth2client==4.1.3
packaging==23.1
pandocfilters==1.5.0
parso==0.8.3
pexpect==4.8.0
pgzip==0.3.5
pickleshare==0.7.5
Pillow==9.5.0
pkgutil_resolve_name==1.3.10
pluggy==1.2.0
prometheus-client==0.17.1
prompt-toolkit==3.0.39
protobuf==3.20.3
psutil==5.9.5
ptyprocess==0.7.0
pyasn1==0.5.0
pyasn1-modules==0.3.0
pycparser==2.21
Pygments==2.16.1
pyparsing==3.1.1
pyrsistent==0.19.3
pytest==7.4.2
python-daemon==3.0.1
python-dateutil==2.8.2
PyYAML==6.0.1
pyzmq==24.0.1
qtconsole==5.4.4
QtPy==2.4.0
regex==2023.8.8
requests==2.31.0
requests-unixsocket==0.3.0
rsa==4.7.2
s3transfer==0.6.2
safetensors==0.3.3
scipy==1.7.3
Send2Trash==1.8.2
sentencepiece==0.1.99
six==1.16.0
sniffio==1.3.0
soupsieve==2.4.1
terminado==0.17.1
tinycss2==1.2.1
tokenizers==0.13.3
tomli==2.0.1
torch==1.13.1
torch-neuronx==1.13.1.1.9.1
torch-xla==1.13.1+torchneuron8
torchvision==0.14.1
tornado==6.2
tqdm==4.66.1
traitlets==5.9.0
transformers==4.30.2
transformers-neuronx==0.7.84
typing_extensions==4.7.1
uritemplate==3.0.1
urllib3==1.26.16
wcwidth==0.2.6
webencodings==0.5.1
websocket-client==1.6.1
wget==3.2
widgetsnbextension==4.0.9
zipp==3.15.0
awsilya commented 9 months ago

@mahendra-paranjpe the most common reason for this error:

tdrv_get_dev_info No neuron device available

is not running on the right instance type. Are you running on inf2 ?

mahendra-paranjpe commented 9 months ago

yes. it is inf2.48xlarge.

mrnikwaws commented 8 months ago

Hi @mahendra-paranjpe,

This can indicate that installation is not complete - e.g. missing drivers. Please check: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/torch-neuronx.html. Note the system packages (rpm/dpkg) files for installation (e.g. https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/neuron-setup/pytorch/neuronx/ubuntu/torch-neuronx-ubuntu22.html#setup-torch-neuronx-ubuntu22 "Drivers and Tools"), and that you are running on one of the supported OS versions.

If you think that is installed correctly - it is possible the driver is not correctly loaded for some reason. Try:

sudo modprobe neuron

... then retry your test. If neither of those works please post back here.

aws-donkrets commented 6 months ago

Hi @mahendra-paranjpe - haven't heard back whether mrnikwaws comments solved your ticket. Closing this out for now. If you are still encountering a problem please reopen or create a new ticket.