TypeError: Got unsupported ScalarType BFloat16

a-szegel commented 10 months ago

Hello Everyone,

I am trying to follow the directions in https://aws.amazon.com/blogs/machine-learning/maximize-stable-diffusion-performance-and-lower-inference-costs-with-aws-inferentia2/. I am not sure what I am doing wrong and would love some help! Thanks in advance!

Simple Env

My environment looks as follows: instance: inf2.8xlarge ami: aws ec2 describe-images --region us-west-2 --owners amazon --filters 'Name=name,Values=Deep Learning AMI Neuron PyTorch 1.13.? (Ubuntu 20.04) ????????' 'Name=state,Values=available' --query 'reverse(sort_by(Images, &CreationDate))[:1].ImageId' --output text

Error

> source /opt/aws_neuron_venv_pytorch/bin/activate
> jupyter nbconvert --to script hf_pretrained_sd2_512_inference.ipynb 
> cp hf_pretrained_sd2_512_inference.py seth_test.py
> python seth_test.py 
Fetching 13 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 210524.91it/s]
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/ubuntu/Developer/run/seth_test.py:189 in <module>                                          │
│                                                                                                  │
│   186 encoder_hidden_states_1b = torch.randn([1, 77, 1024], dtype=DTYPE)                         │
│   187 example_inputs = sample_1b, timestep_1b, encoder_hidden_states_1b                          │
│   188                                                                                            │
│ ❱ 189 unet_neuron = torch_neuronx.trace(                                                         │
│   190 │   unet,                                                                                  │
│   191 │   example_inputs,                                                                        │
│   192 │   compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'unet'),                          │
│                                                                                                  │
│ /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/torch_neuronx/xla_impl/trace.py:265 in  │
│ trace                                                                                            │
│                                                                                                  │
│   262 │   │   hlo_filename = os.path.join(model_dir, 'graph.hlo')                                │
│   263 │   │                                                                                      │
│   264 │   │   # Write weights to disk                                                            │
│ ❱ 265 │   │   weight_paths = write_params(model_dir, constant_parameter_tensors)                 │
│   266 │   │                                                                                      │
│   267 │   │   table = {                                                                          │
│   268 │   │   │   "model_files": "graph.hlo",                                                    │
│                                                                                                  │
│ /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/torch_neuronx/xla_impl/trace.py:306 in  │
│ write_params                                                                                     │
│                                                                                                  │
│   303 │                                                                                          │
│   304 │   # Write tensor data to disk                                                            │
│   305 │   for name, weight in weights.items():                                                   │
│ ❱ 306 │   │   np.save(f'{directory}/weights/{name}.npy', weight.numpy())                         │
│   307 │                                                                                          │
│   308 │   # Write mapping file. Paths are relative to the directory                              │
│   309 │   weight_paths = {name: f'weights/{name}.npy' for name in weights}                       │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
TypeError: Got unsupported ScalarType BFloat16

PIP versions

Python Dependencies:

(aws_neuron_venv_pytorch) ubuntu@ip-172-31-1-65:~/Developer/run$ pip freeze
absl-py==1.4.0
accelerate==0.16.0
aiofiles==22.1.0
aiohttp==3.8.4
aiosignal==1.3.1
aiosqlite==0.19.0
amqp==5.1.1
anyio==3.6.2
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
arrow==1.2.3
astroid==2.15.4
asttokens==2.2.1
async-timeout==4.0.2
attrs==23.1.0
Automat==22.10.0
aws-neuronx-runtime-discovery==2.9
awscli==1.27.126
Babel==2.12.1
backcall==0.2.0
beautifulsoup4==4.12.2
billiard==3.6.4.0
bleach==6.0.0
boto3==1.26.126
botocore==1.29.126
build==0.10.0
cachetools==5.3.0
celery==5.2.7
certifi==2022.12.7
cffi==1.15.1
charset-normalizer==3.1.0
click==8.1.3
click-didyoumean==0.3.0
click-plugins==1.1.1
click-repl==0.2.0
cloud-tpu-client==0.10
cloudpickle==2.2.1
cmake==3.26.3
colorama==0.4.4
comm==0.1.3
constantly==15.1.0
contourpy==1.0.7
cryptography==40.0.2
cssselect==1.2.0
cycler==0.11.0
dask==2023.4.1
debugpy==1.6.7
decorator==5.1.1
defusedxml==0.7.1
diffusers==0.14.0
dill==0.3.6
distlib==0.3.6
docutils==0.16
dparse==0.6.2
exceptiongroup==1.1.1
executing==1.2.0
fastapi==0.95.1
fastjsonschema==2.16.3
filelock==3.12.0
fonttools==4.39.3
fqdn==1.5.1
frozenlist==1.3.3
fsspec==2023.4.0
google-api-core==1.34.0
google-api-python-client==1.8.0
google-auth==2.17.3
google-auth-httplib2==0.1.0
googleapis-common-protos==1.59.0
httpie==3.2.1
httplib2==0.22.0
huggingface-hub==0.14.1
hyperlink==21.0.0
idna==3.4
imageio==2.28.1
importlib-metadata==6.6.0
importlib-resources==5.12.0
incremental==22.10.0
iniconfig==2.0.0
install==1.3.5
ipykernel==6.22.0
ipython==8.12.2
ipython-genutils==0.2.0
ipywidgets==8.0.6
islpy==2022.1.1
isoduration==20.11.0
isort==5.12.0
itemadapter==0.8.0
itemloaders==1.1.0
jedi==0.18.2
Jinja2==3.1.2
jmespath==1.0.1
joblib==1.2.0
json5==0.9.11
jsonpointer==2.3
jsonschema==4.17.3
jupyter-events==0.6.3
jupyter-ydoc==0.2.4
jupyter_client==8.2.0
jupyter_core==5.3.0
jupyter_server==2.5.0
jupyter_server_fileid==0.9.0
jupyter_server_terminals==0.4.4
jupyter_server_ydoc==0.8.0
jupyterlab==3.6.3
jupyterlab-pygments==0.2.2
jupyterlab-widgets==3.0.7
jupyterlab_server==2.22.1
kiwisolver==1.4.4
kombu==5.2.4
lazy-object-proxy==1.9.0
libneuronxla==0.5.205
llvmlite==0.40.0
locket==1.0.0
lockfile==0.12.2
lxml==4.9.2
markdown-it-py==2.2.0
MarkupSafe==2.1.2
matplotlib==3.7.1
matplotlib-inline==0.1.6
mccabe==0.7.0
mdurl==0.1.2
mistune==2.0.5
multidict==6.0.4
nbclassic==1.0.0
nbclient==0.7.4
nbconvert==7.3.1
nbformat==5.8.0
nest-asyncio==1.5.6
networkx==2.6.3
neuronx-cc==2.6.0.19+3d819e565
neuronx-hwm==2.6.0.0+826e77395
notebook==6.5.4
notebook_shim==0.2.3
numba==0.57.0
numpy==1.21.6
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
oauth2client==4.1.3
opencv-python==4.7.0.72
packaging==21.3
pandas==2.0.1
pandocfilters==1.5.0
parsel==1.8.1
parso==0.8.3
partd==1.4.0
pexpect==4.8.0
pgzip==0.3.4
pickleshare==0.7.5
Pillow==9.5.0
pip-tools==6.13.0
pipenv==2023.2.4
pkg_resources==0.0.0
pkgutil_resolve_name==1.3.10
platformdirs==3.5.0
plotly==5.14.1
pluggy==1.0.0
prometheus-client==0.16.0
prompt-toolkit==3.0.38
Protego==0.2.1
protobuf==3.20.3
psutil==5.9.5
ptyprocess==0.7.0
pure-eval==0.2.2
pyasn1==0.5.0
pyasn1-modules==0.3.0
pycparser==2.21
pydantic==1.10.7
PyDispatcher==2.0.7
Pygments==2.15.1
pylint==2.17.3
pyOpenSSL==23.1.1
pyparsing==3.0.9
pyproject_hooks==1.0.0
pyrsistent==0.19.3
PySocks==1.7.1
pytest==7.3.1
python-daemon==3.0.1
python-dateutil==2.8.2
python-json-logger==2.0.7
pytz==2023.3
PyYAML==5.4.1
pyzmq==25.0.2
queuelib==1.6.2
regex==2023.5.5
requests==2.29.0
requests-file==1.5.1
requests-toolbelt==1.0.0
requests-unixsocket==0.3.0
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rich==13.3.5
rsa==4.7.2
ruamel.yaml==0.17.22
ruamel.yaml.clib==0.2.7
s3transfer==0.6.0
safetensors==0.3.1
scikit-learn==1.2.2
scipy==1.7.3
Scrapy==2.8.0
seaborn==0.12.2
Send2Trash==1.8.2
service-identity==21.1.0
shap==0.41.0
six==1.16.0
slicer==0.0.7
sniffio==1.3.0
soupsieve==2.4.1
stack-data==0.6.2
starlette==0.26.1
tenacity==8.2.2
terminado==0.17.1
threadpoolctl==3.1.0
tinycss2==1.2.1
tldextract==3.4.1
tokenizers==0.13.3
toml==0.10.2
tomli==2.0.1
tomlkit==0.11.8
toolz==0.12.0
torch==1.13.1
torch-neuronx==1.13.0.1.6.1
torch-xla==1.13.0+torchneuron5
torchvision==0.14.0
tornado==6.3.1
tqdm==4.65.0
traitlets==5.9.0
transformers==4.30.2
Twisted==22.10.0
typing_extensions==4.5.0
tzdata==2023.3
uri-template==1.2.0
uritemplate==3.0.1
urllib3==1.26.15
vine==5.0.0
virtualenv==20.23.0
virtualenv-clone==0.5.7
w3lib==2.1.1
wcwidth==0.2.6
webcolors==1.13
webencodings==0.5.1
websocket-client==1.5.1
widgetsnbextension==4.0.7
wrapt==1.15.0
y-py==0.5.9
yarl==1.9.2
ypy-websocket==0.8.2
zipp==3.15.0
zope.interface==6.0

Shellmode commented 10 months ago

Same problem. I tried to deploy in us-west-2, and my EC2 AMI ID is ami-0763990f1c2645d21, AMI name is "Deep Learning AMI Neuron PyTorch 1.13.0 (Amazon Linux 2) 20230504"

And I also tried hf_pretrained_sd15_512_inference.ipynb, I got module 'torch_neuronx' has no attribute 'async_load'

jyang-aws commented 9 months ago

@Shellmode The async_load error you mentioned is due to a newer api call that's not supported until neuron release 2.12,0 (07/19), details: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/prev/rn.html#id5. Your ami is from a previous date 05/04, so it won't capture it. Could you try updating your neuron package to a newer release (say 2.13.2)? https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/neuron-setup/pytorch/neuronx/ubuntu/torch-neuronx-ubuntu20.html#get-started-with-latest-release-of-pytorch-neuron-torch-neuronx

python -m pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com
python -m pip install  --force-reinstall neuronx-cc==2.* torch-neuronx torchvision

jyang-aws commented 9 months ago

@a-szegel From your package, I can see your ami uses an earlier neuron release, Could you give a try on updating the packages? Similar steps as above

python -m pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com
python -m pip install  --force-reinstall neuronx-cc==2.* torch-neuronx torchvision

between these two lines:

source /opt/aws_neuron_venv_pytorch/bin/activate jupyter nbconvert --to script hf_pretrained_sd2_512_inference.ipynb

a-szegel commented 9 months ago

Thank you! That fixes it. Is there anyway to have our blog posts/tutorials more accurately pin dependencies so they always work. Instead of the tutorial saying to clone main, we could clone a tag. In that tag, we can have a python requirements file that locks every dependency. We can also fix versions for non-python dependencies, so we don't get any surprises by AMI updates. I think it is very important that our examples work out of the box so people who are new to Graviton + ML have a positive experience.

Shellmode commented 9 months ago

@Shellmode The async_load error you mentioned is due to a newer api call that's not supported until neuron release 2.12,0 (07/19), details: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/prev/rn.html#id5. Your ami is from a previous date 05/04, so it won't capture it. Could you try updating your neuron package to a newer release (say 2.13.2)? https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/neuron-setup/pytorch/neuronx/ubuntu/torch-neuronx-ubuntu20.html#get-started-with-latest-release-of-pytorch-neuron-torch-neuronx
python -m pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com
python -m pip install  --force-reinstall neuronx-cc==2.* torch-neuronx torchvision

Thanks, fixed.

aws-neuron / aws-neuron-samples