aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
https://aws.amazon.com/machine-learning/neuron/
Other
444 stars 148 forks source link

Very long, empty start-up #994

Open nandeeka opened 15 hours ago

nandeeka commented 15 hours ago

I am trying to implement a kernel (unmerged_lora) using just PyTorch. Unfortunately, when I open the profile, there is a very long, empty start-up (0.77ms / 2.36ms required for the kernel). While I understand that some start-up is required, this seems like a bug (for example, the merged_lora does not suffer from the same problem).

The buggy profile looks something like:

Screenshot 2024-09-30 at 5 46 42 PM

The source code is:

import os

import neuronxcc.nki.language as nl
import neuronxcc.nki.isa as ni
import torch
from torch_neuronx import nki_jit
from torch_xla.core import xla_model as xm

def merged_lora(X, W):
  return torch.matmul(W, X)

def unmerged_lora(X, W0, loraA, loraB, lora_dropout, lora_alpha, R):
  dropout_layer = torch.nn.Dropout(p=lora_dropout)
  scaling = lora_alpha / R

  delta_W = torch.matmul(loraB, torch.matmul(loraA, dropout_layer(X)))

  return torch.matmul(W0, X) + delta_W * scaling

def main():
  B = 2048
  D = 4096

  # Parameters from Idefics2
  # Source: https://colab.research.google.com/drive/1NtcTgRbSBKN7pYD3Vdx1j9m8pt3fhFDB?usp=sharing#scrollTo=SMujNa2vKbZd
  R = 8
  lora_alpha = 8
  lora_dropout = 0.1

  device = xm.xla_device()
  cpu = torch.device('cpu')

  W0 = torch.randn(D, D, dtype=torch.float16).to(device)
  X = torch.randn(D, B, dtype=torch.float16).to(device)
  loraA = torch.randn(R, D, dtype=torch.float16).to(device)
  loraB = torch.randn(D, R, dtype=torch.float16).to(device)

  H = unmerged_lora(X, W0, loraA, loraB, lora_dropout, lora_alpha, R)
  # H = merged_lora(X, W0)

  H = H.to(device=cpu)

  print(H)

if __name__ == "__main__":
  os.environ["NEURON_FRAMEWORK_DEBUG"] = "1"
  os.environ["NEURON_CC_FLAGS"]= " --disable-internal-io-dge  "

  main()

My profiler version is:

neuron-profile 2.19.0.0%kaena-tools/2.19@c48a122 built on 2024-08-02T17:21:14Z

And my pip freeze is:

absl-py==2.1.0
aiohappyeyeballs==2.4.0
aiohttp==3.10.5
aiosignal==1.3.1
amqp==5.2.0
annotated-types==0.7.0
ansicolors==1.1.8
anyio==4.4.0
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
arrow==1.3.0
astroid==3.2.4
asttokens==2.4.1
async-lru==2.0.4
async-timeout==4.0.3
attrs==24.2.0
Automat==24.8.1
aws-neuronx-runtime-discovery==2.9
awscli==1.34.19
babel==2.16.0
beautifulsoup4==4.12.3
billiard==4.2.0
bleach==6.1.0
boto3==1.35.19
botocore==1.35.19
build==1.2.2
cachetools==5.5.0
celery==5.4.0
certifi==2024.8.30
cffi==1.17.1
charset-normalizer==3.3.2
click==8.1.7
click-didyoumean==0.3.1
click-plugins==1.1.1
click-repl==0.3.0
cloud-tpu-client==0.10
cloudpickle==3.0.0
cmake==3.30.3
colorama==0.4.6
comm==0.2.2
constantly==23.10.4
contourpy==1.3.0
cryptography==43.0.1
cssselect==1.2.0
cycler==0.12.1
dask==2024.9.0
debugpy==1.8.5
decorator==5.1.1
defusedxml==0.7.1
dill==0.3.8
distlib==0.3.8
docutils==0.16
dparse==0.6.3
ec2-metadata==2.13.0
entrypoints==0.4
environment-kernels==1.2.0
exceptiongroup==1.2.2
executing==2.1.0
fastapi==0.114.2
fastjsonschema==2.20.0
filelock==3.16.0
fonttools==4.53.1
fqdn==1.5.1
frozenlist==1.4.1
fsspec==2024.9.0
google-api-core==1.34.1
google-api-python-client==1.8.0
google-auth==2.34.0
google-auth-httplib2==0.2.0
googleapis-common-protos==1.65.0
h11==0.14.0
httpcore==1.0.5
httpie==3.2.3
httplib2==0.22.0
httpx==0.27.2
hyperlink==21.0.0
idna==3.10
imageio==2.35.1
importlib_metadata==8.5.0
incremental==24.7.2
iniconfig==2.0.0
ipykernel==6.29.5
ipython==8.27.0
ipywidgets==8.1.5
islpy==2023.2.5
isoduration==20.11.0
isort==5.13.2
itemadapter==0.9.0
itemloaders==1.3.1
jedi==0.19.1
Jinja2==3.1.4
jmespath==1.0.1
joblib==1.4.2
json5==0.9.25
jsonpointer==3.0.0
jsonschema==4.23.0
jsonschema-specifications==2023.12.1
jupyter==1.1.1
jupyter-console==6.6.3
jupyter-events==0.10.0
jupyter-lsp==2.2.5
jupyter_client==8.6.2
jupyter_core==5.7.2
jupyter_server==2.14.2
jupyter_server_terminals==0.5.3
jupyterlab==4.2.5
jupyterlab_pygments==0.3.0
jupyterlab_server==2.27.3
jupyterlab_widgets==3.0.13
kiwisolver==1.4.7
kombu==5.4.1
libneuronxla==2.0.4115.0
llvmlite==0.43.0
locket==1.0.0
lockfile==0.12.2
lxml==5.3.0
markdown-it-py==3.0.0
MarkupSafe==2.1.5
matplotlib==3.9.2
matplotlib-inline==0.1.7
mccabe==0.7.0
mdurl==0.1.2
mistune==3.0.2
ml-dtypes==0.2.0
mpmath==1.3.0
multidict==6.1.0
nbclient==0.10.0
nbconvert==7.16.4
nbformat==5.10.4
nest-asyncio==1.6.0
networkx==2.8.8
neuronx-cc==2.15.128.0+56dc5a86
neuronx-distributed==0.9.0
neuronx-distributed-training==1.0.0
notebook==7.2.2
notebook_shim==0.2.4
numba==0.60.0
numpy==1.25.2
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.18.1
nvidia-nvjitlink-cu12==12.6.68
nvidia-nvtx-cu12==12.1.105
oauth2client==4.1.3
opencv-python==4.10.0.84
overrides==7.7.0
packaging==21.3
pandas==2.2.2
pandocfilters==1.5.1
papermill==2.6.0
parsel==1.9.1
parso==0.8.4
partd==1.4.2
pexpect==4.9.0
pgzip==0.3.5
pillow==10.4.0
pip-tools==7.4.1
pipenv==2024.0.2
platformdirs==4.3.3
plotly==5.24.1
pluggy==1.5.0
prometheus_client==0.20.0
prompt_toolkit==3.0.47
Protego==0.3.1
protobuf==3.20.3
psutil==6.0.0
ptyprocess==0.7.0
pure_eval==0.2.3
pyasn1==0.6.1
pyasn1_modules==0.4.1
pycparser==2.22
pydantic==2.9.1
pydantic_core==2.23.3
PyDispatcher==2.0.7
Pygments==2.18.0
pylint==3.2.7
pyOpenSSL==24.2.1
pyparsing==3.1.4
pyproject_hooks==1.1.0
PySocks==1.7.1
pytest==8.3.3
python-daemon==3.0.1
python-dateutil==2.9.0.post0
python-json-logger==2.0.7
pytz==2024.2
PyYAML==6.0.2
pyzmq==26.2.0
queuelib==1.7.0
referencing==0.35.1
requests==2.31.0
requests-file==2.1.0
requests-toolbelt==1.0.0
requests-unixsocket==0.3.0
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rich==13.8.1
rpds-py==0.20.0
rsa==4.7.2
ruamel.yaml==0.18.6
ruamel.yaml.clib==0.2.8
s3transfer==0.10.2
safetensors==0.4.5
safety==2.3.5
scikit-learn==1.5.2
scipy==1.11.2
Scrapy==2.11.2
seaborn==0.13.2
Send2Trash==1.8.3
service-identity==24.1.0
shap==0.46.0
six==1.16.0
slicer==0.0.8
sniffio==1.3.1
soupsieve==2.6
stack-data==0.6.3
starlette==0.38.5
sympy==1.13.2
tenacity==9.0.0
terminado==0.18.1
threadpoolctl==3.5.0
tinycss2==1.3.0
tldextract==5.1.2
tomli==2.0.1
tomlkit==0.13.2
toolz==0.12.1
torch==2.1.2
torch-neuronx==2.1.2.2.3.0
torch-xla==2.1.4
torchvision==0.16.2
tornado==6.4.1
tqdm==4.66.5
traitlets==5.14.3
triton==2.1.0
Twisted==24.7.0
types-python-dateutil==2.9.0.20240906
typing_extensions==4.12.2
tzdata==2024.1
uri-template==1.3.0
uritemplate==3.0.1
urllib3==2.2.3
vine==5.1.0
virtualenv==20.26.4
w3lib==2.2.1
wcwidth==0.2.13
webcolors==24.8.0
webencodings==0.5.1
websocket-client==1.8.0
wget==3.2
widgetsnbextension==4.0.13
yarl==1.11.1
zipp==3.20.2
zope.interface==7.0.3
aws-serina-tan commented 14 hours ago

Hi Nandeeka, did you use --profile-nth-exec=2 when running neuron-profile capture? This arg will capture your profile for the second iteration instead. The first iteration of a neff would normally have a ~1ms startup time related to Gpsimd cores loading libraries from device memory. I tried this arg on my end, and I got a profile with 1.58ms runtime. Lemme know if this doesn't work for you even with the arg.

FYI, There is also a small note on this in the NKI guide: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/neuron_profile_for_nki.html#profile-a-nki-kernel.