Running llama-2 7b Chat on inf2.8xlarge machine

sushantMoon commented 9 months ago

When I want to load the model for inference following the steps give on reference file: https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/meta-llama-2-13b-sampling.ipynb

import os
import time
import torch
from transformers import AutoTokenizer
from transformers_neuronx.llama.model import LlamaForSampling

os.environ["NEURON_CC_FLAGS"] = "--model-type=transformer-inference"

# load meta-llama/Llama-2-7b-chat to the NeuronCores with 2-way tensor parallelism and run compilation
neuron_model = LlamaForSampling.from_pretrained('llama-2-7b-chat-hf-chunked', batch_size=1, tp_degree=2, amp='f16')
neuron_model.to_neuron()

Following is the error I receive,

{
    "name": "RuntimeError",
    "message": "Failed compilation with ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/neuroncc_compile_workdir/1876f897-9e62-4653-bef8-2caa4237adbc/model.MODULE_cd5e0485cb697fcd2bf8+56a4bda8.hlo.pb', '--output', '/tmp/neuroncc_compile_workdir/1876f897-9e62-4653-bef8-2caa4237adbc/model.MODULE_cd5e0485cb697fcd2bf8+56a4bda8.neff', '--model-type=transformer-inference', '--model-type=transformer', '--verbose=35']: 2023-09-19T09:36:54Z Too many instructions after unroll for function sg0000 !
",
    "stack": "---------------------------------------------------------------------------
_RemoteTraceback                          Traceback (most recent call last)
_RemoteTraceback: 
\"\"\"
Traceback (most recent call last):
  File \"/home/ubuntu/miniconda3/envs/torch_neuronx_2140/lib/python3.10/concurrent/futures/process.py\", line 246, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File \"/home/ubuntu/miniconda3/envs/torch_neuronx_2140/lib/python3.10/site-packages/transformers_neuronx/compiler.py\", line 411, in compile
    self.build(tag=tag)
  File \"/home/ubuntu/miniconda3/envs/torch_neuronx_2140/lib/python3.10/site-packages/transformers_neuronx/compiler.py\", line 418, in build
    self.neff_bytes = compile_hlo_module(self.hlo_module, tag)
  File \"/home/ubuntu/miniconda3/envs/torch_neuronx_2140/lib/python3.10/site-packages/transformers_neuronx/compiler.py\", line 95, in compile_hlo_module
    neff_bytes = neuron_xla_compile(module_bytes, flags, input_format=\"hlo\", platform_target=\"trn1\",
  File \"/home/ubuntu/miniconda3/envs/torch_neuronx_2140/lib/python3.10/site-packages/libneuronxla/__init__.py\", line 38, in neuron_xla_compile
    _neuron_cc_wrapper.neuron_xla_compile(
  File \"/home/ubuntu/miniconda3/envs/torch_neuronx_2140/lib/python3.10/site-packages/libneuronxla/neuron_cc_wrapper.py\", line 234, in neuron_xla_compile
    done = check_neff(compile_cache, neff_path,
  File \"/home/ubuntu/miniconda3/envs/torch_neuronx_2140/lib/python3.10/site-packages/libneuronxla/neuron_cc_wrapper.py\", line 77, in check_neff
    raise(RuntimeError(error_log))
RuntimeError: Failed compilation with ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/neuroncc_compile_workdir/1876f897-9e62-4653-bef8-2caa4237adbc/model.MODULE_cd5e0485cb697fcd2bf8+56a4bda8.hlo.pb', '--output', '/tmp/neuroncc_compile_workdir/1876f897-9e62-4653-bef8-2caa4237adbc/model.MODULE_cd5e0485cb697fcd2bf8+56a4bda8.neff', '--model-type=transformer-inference', '--model-type=transformer', '--verbose=35']: 2023-09-19T09:36:54Z Too many instructions after unroll for function sg0000 !

\"\"\"

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
/home/ubuntu/llama_2-inf2/01-llama-2-7b-chat-neuronx.ipynb Cell 7 line 1
      <a href='vscode-notebook-cell://ssh-remote%2Binf2-instance-deployment-testing/home/ubuntu/llama_2-inf2/01-llama-2-7b-chat-neuronx.ipynb#X13sdnNjb2RlLXJlbW90ZQ%3D%3D?line=8'>9</a> # load meta-llama/Llama-2-7b-chat to the NeuronCores with 2-way tensor parallelism and run compilation
     <a href='vscode-notebook-cell://ssh-remote%2Binf2-instance-deployment-testing/home/ubuntu/llama_2-inf2/01-llama-2-7b-chat-neuronx.ipynb#X13sdnNjb2RlLXJlbW90ZQ%3D%3D?line=9'>10</a> neuron_model = LlamaForSampling.from_pretrained('llama-2-7b-chat-hf-chunked', batch_size=1, tp_degree=2, amp='f16')
---> <a href='vscode-notebook-cell://ssh-remote%2Binf2-instance-deployment-testing/home/ubuntu/llama_2-inf2/01-llama-2-7b-chat-neuronx.ipynb#X13sdnNjb2RlLXJlbW90ZQ%3D%3D?line=10'>11</a> neuron_model.to_neuron()

File ~/miniconda3/envs/torch_neuronx_2140/lib/python3.10/site-packages/transformers_neuronx/llama/model.py:122, in LlamaForSampling.to_neuron(self)
    120 self.decoder_lm_head_for_context = {}
    121 for context_length_estimate in self.context_buckets:
--> 122     model = self.decoder_lm_head.build_weight_shared(
    123         n_positions_list=[context_length_estimate],
    124         n_active_tokens=context_length_estimate,
    125         unroll=self.context_unroll,
    126         share_caches=True,
    127     )
    128     # PERF: No latency improvement seen in multi-layer models from executor
    129     if self.context_unroll == self.config.num_hidden_layers:

File ~/miniconda3/envs/torch_neuronx_2140/lib/python3.10/site-packages/transformers_neuronx/decoder.py:163, in DecoderLmHeadForSamplingNoEmbedding.build_weight_shared(self, n_positions_list, n_active_tokens, batch_size, unroll, share_caches)
    161     ln_lm_head_params.append(new.lm_head_bias)
    162 new.program = new._build_program()
--> 163 new.program.setup(new.layers, ln_lm_head_params)
    164 return new

File ~/miniconda3/envs/torch_neuronx_2140/lib/python3.10/site-packages/transformers_neuronx/decoder.py:1029, in DecoderProgramFullyUnrolled.setup(self, layers, ln_lm_head_params)
   1028 def setup(self, layers, ln_lm_head_params):
-> 1029     super().setup(layers, ln_lm_head_params)
   1030     for npos, memory in zip(self.n_positions_list, self.memories):
   1031         input_tensors = [*self.input_buffers]

File ~/miniconda3/envs/torch_neuronx_2140/lib/python3.10/site-packages/transformers_neuronx/decoder.py:919, in DecoderProgram.setup(self, layers, ln_lm_head_params, io_ring_cache_size)
    917         neff_bytes_futures.append(future)
    918     for kernel, future in zip(self.kernels, neff_bytes_futures):
--> 919         kernel.neff_bytes = future.result()
    921 for kernel in self.kernels:
    922     kernel.load(io_ring_cache_size)

File ~/miniconda3/envs/torch_neuronx_2140/lib/python3.10/concurrent/futures/_base.py:458, in Future.result(self, timeout)
    456     raise CancelledError()
    457 elif self._state == FINISHED:
--> 458     return self.__get_result()
    459 else:
    460     raise TimeoutError()

File ~/miniconda3/envs/torch_neuronx_2140/lib/python3.10/concurrent/futures/_base.py:403, in Future.__get_result(self)
    401 if self._exception:
    402     try:
--> 403         raise self._exception
    404     finally:
    405         # Break a reference cycle with the exception in self._exception
    406         self = None

RuntimeError: Failed compilation with ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/neuroncc_compile_workdir/1876f897-9e62-4653-bef8-2caa4237adbc/model.MODULE_cd5e0485cb697fcd2bf8+56a4bda8.hlo.pb', '--output', '/tmp/neuroncc_compile_workdir/1876f897-9e62-4653-bef8-2caa4237adbc/model.MODULE_cd5e0485cb697fcd2bf8+56a4bda8.neff', '--model-type=transformer-inference', '--model-type=transformer', '--verbose=35']: 2023-09-19T09:36:54Z Too many instructions after unroll for function sg0000 !
"
}

Also, in the reference file, they use tp_degree=24 when working with inf2.48xlarge which has 384 GB of Accelerator Memory, since I am working with inf2.8xlarge with 32 GB of Accelerator memory, I used tp_degree=2

I have the following versions of the dependencies installed,

Requirement already satisfied: neuronx-cc==2.* in /home/ubuntu/miniconda3/envs/torch_neuronx_2140/lib/python3.10/site-packages (2.10.0.34+6c8792c6f)
Requirement already satisfied: transformers-neuronx in /home/ubuntu/miniconda3/envs/torch_neuronx_2140/lib/python3.10/site-packages (0.7.84)
aws-neuronx-dkms is already the newest version (2.13.4.0).
aws-neuronx-collectives is already the newest version (2.17.9.0-fb6d14044).
aws-neuronx-runtime-lib is already the newest version (2.17.7.0-df62e3f70).
aws-neuronx-tools is already the newest version (2.14.6.0).

sumaiyah commented 9 months ago

What Amazon Machine Image are you using to launch your EC2 instance. I could only run the code using the Deep Learning AMI Neuron PyTorch 1.13 (Ubuntu 20.04) 20230912 image

sushantMoon commented 9 months ago

I used base Ubuntu, 22.04 LTS and installed everything else as per the documentation given here https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/neuron-setup/pytorch/neuronx/ubuntu/torch-neuronx-ubuntu22.html#setup-torch-neuronx-ubuntu22.

@sumaiyah were you able to generate sequences from the model.

sumaiyah commented 9 months ago

I did manage to generate sequences using this model

I followed this guide

https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/neuron-setup/pytorch/neuronx/ubuntu/torch-neuronx-ubuntu20-pytorch-dlami.html

with this AMI Deep Learning AMI Neuron PyTorch 1.13 (Ubuntu 20.04) 20230912

I then ran

pip install sentencepiece transformers transformers-neuronx

and then ran the notebook code

sushantMoon commented 9 months ago

Hi @sumaiyah I recreated the instance and followed the steps given in the link you shared, but still I could not get the model running; following is the error,

{
    "name": "RuntimeError",
    "message": "Failed compilation with ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/neuroncc_compile_workdir/b306f2a9-16a6-468b-851f-4cfa27068af0/model.MODULE_cd5e0485cb697fcd2bf8+a1fe2fe5.hlo.pb', '--output', '/tmp/neuroncc_compile_workdir/b306f2a9-16a6-468b-851f-4cfa27068af0/model.MODULE_cd5e0485cb697fcd2bf8+a1fe2fe5.neff', '--model-type=transformer', '--model-type=transformer', '--verbose=35']: 2023-09-19T17:23:52Z Too many instructions after unroll for function sg0000 !
",
    "stack": "---------------------------------------------------------------------------
_RemoteTraceback                          Traceback (most recent call last)
_RemoteTraceback: 
\"\"\"
Traceback (most recent call last):
  File \"/usr/lib/python3.8/concurrent/futures/process.py\", line 239, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File \"/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/compiler.py\", line 411, in compile
    self.build(tag=tag)
  File \"/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/compiler.py\", line 418, in build
    self.neff_bytes = compile_hlo_module(self.hlo_module, tag)
  File \"/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/compiler.py\", line 95, in compile_hlo_module
    neff_bytes = neuron_xla_compile(module_bytes, flags, input_format=\"hlo\", platform_target=\"trn1\",
  File \"/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/libneuronxla/__init__.py\", line 38, in neuron_xla_compile
    _neuron_cc_wrapper.neuron_xla_compile(
  File \"/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/libneuronxla/neuron_cc_wrapper.py\", line 267, in neuron_xla_compile
    ret = compile_with_cache(output, compile_cache, cache_key, execution_mode,
  File \"/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/libneuronxla/neuron_cc_wrapper.py\", line 201, in compile_with_cache
    raise(e)
  File \"/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/libneuronxla/neuron_cc_wrapper.py\", line 181, in compile_with_cache
    ret = call_neuron_compiler(
  File \"/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/libneuronxla/neuron_cc_wrapper.py\", line 129, in call_neuron_compiler
    raise RuntimeError(f\"Failed compilation with {cmd}: {res.stderr.decode()}\")
RuntimeError: Failed compilation with ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/neuroncc_compile_workdir/b306f2a9-16a6-468b-851f-4cfa27068af0/model.MODULE_cd5e0485cb697fcd2bf8+a1fe2fe5.hlo.pb', '--output', '/tmp/neuroncc_compile_workdir/b306f2a9-16a6-468b-851f-4cfa27068af0/model.MODULE_cd5e0485cb697fcd2bf8+a1fe2fe5.neff', '--model-type=transformer', '--model-type=transformer', '--verbose=35']: 2023-09-19T17:23:52Z Too many instructions after unroll for function sg0000 !

\"\"\"

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
/home/ubuntu/llama-2/01-llama-2-7b-chat-neuronx.ipynb Cell 7 line 1
      <a href='vscode-notebook-cell://ssh-remote%2Binf2-8xlarge/home/ubuntu/llama-2/01-llama-2-7b-chat-neuronx.ipynb#W6sdnNjb2RlLXJlbW90ZQ%3D%3D?line=8'>9</a> # load meta-llama/Llama-2-7b-chat to the NeuronCores with 2-way tensor parallelism and run compilation
     <a href='vscode-notebook-cell://ssh-remote%2Binf2-8xlarge/home/ubuntu/llama-2/01-llama-2-7b-chat-neuronx.ipynb#W6sdnNjb2RlLXJlbW90ZQ%3D%3D?line=9'>10</a> neuron_model = LlamaForSampling.from_pretrained('llama-2-7b-chat-hf-chunked', batch_size=1, tp_degree=2, amp='f16')
---> <a href='vscode-notebook-cell://ssh-remote%2Binf2-8xlarge/home/ubuntu/llama-2/01-llama-2-7b-chat-neuronx.ipynb#W6sdnNjb2RlLXJlbW90ZQ%3D%3D?line=10'>11</a> neuron_model.to_neuron()

File /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/llama/model.py:122, in LlamaForSampling.to_neuron(self)
    120 self.decoder_lm_head_for_context = {}
    121 for context_length_estimate in self.context_buckets:
--> 122     model = self.decoder_lm_head.build_weight_shared(
    123         n_positions_list=[context_length_estimate],
    124         n_active_tokens=context_length_estimate,
    125         unroll=self.context_unroll,
    126         share_caches=True,
    127     )
    128     # PERF: No latency improvement seen in multi-layer models from executor
    129     if self.context_unroll == self.config.num_hidden_layers:

File /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/decoder.py:163, in DecoderLmHeadForSamplingNoEmbedding.build_weight_shared(self, n_positions_list, n_active_tokens, batch_size, unroll, share_caches)
    161     ln_lm_head_params.append(new.lm_head_bias)
    162 new.program = new._build_program()
--> 163 new.program.setup(new.layers, ln_lm_head_params)
    164 return new

File /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/decoder.py:1029, in DecoderProgramFullyUnrolled.setup(self, layers, ln_lm_head_params)
   1028 def setup(self, layers, ln_lm_head_params):
-> 1029     super().setup(layers, ln_lm_head_params)
   1030     for npos, memory in zip(self.n_positions_list, self.memories):
   1031         input_tensors = [*self.input_buffers]

File /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/decoder.py:919, in DecoderProgram.setup(self, layers, ln_lm_head_params, io_ring_cache_size)
    917         neff_bytes_futures.append(future)
    918     for kernel, future in zip(self.kernels, neff_bytes_futures):
--> 919         kernel.neff_bytes = future.result()
    921 for kernel in self.kernels:
    922     kernel.load(io_ring_cache_size)

File /usr/lib/python3.8/concurrent/futures/_base.py:444, in Future.result(self, timeout)
    442     raise CancelledError()
    443 elif self._state == FINISHED:
--> 444     return self.__get_result()
    445 else:
    446     raise TimeoutError()

File /usr/lib/python3.8/concurrent/futures/_base.py:389, in Future.__get_result(self)
    387 if self._exception:
    388     try:
--> 389         raise self._exception
    390     finally:
    391         # Break a reference cycle with the exception in self._exception
    392         self = None

RuntimeError: Failed compilation with ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/neuroncc_compile_workdir/b306f2a9-16a6-468b-851f-4cfa27068af0/model.MODULE_cd5e0485cb697fcd2bf8+a1fe2fe5.hlo.pb', '--output', '/tmp/neuroncc_compile_workdir/b306f2a9-16a6-468b-851f-4cfa27068af0/model.MODULE_cd5e0485cb697fcd2bf8+a1fe2fe5.neff', '--model-type=transformer', '--model-type=transformer', '--verbose=35']: 2023-09-19T17:23:52Z Too many instructions after unroll for function sg0000 !
"
}

sumaiyah commented 9 months ago

I tried setting it up again and it looks like I get this error too, however when i try to run the next lines in the notebook to generate a sequence they work (depite the runtime error) which I believe is another issue to be raised.

Did you try running all the way down to the neuron_model.sample even after the error?

aws-rhsoln commented 9 months ago

Thank you for creating the issue. We are able to reproduce the issue with the latest release. We are working on a fix and would send out an update once we have a solution

aws-rhsoln commented 9 months ago

Update: We are working on a fix and should have it in the upcoming releases. However, you can unblock yourself by using os.environ["NEURON_CC_FLAGS"] = "-O1".

aws-donkrets commented 8 months ago

HI sushantMoon, Were you able to try the suggested "-O1" compiler option to see if that unblocked you?

aws-neuron / aws-neuron-samples

Running llama-2 7b Chat on inf2.8xlarge machine #40