Open sushantMoon opened 9 months ago
What Amazon Machine Image are you using to launch your EC2 instance. I could only run the code using the Deep Learning AMI Neuron PyTorch 1.13 (Ubuntu 20.04) 20230912
image
I used base Ubuntu, 22.04 LTS and installed everything else as per the documentation given here https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/neuron-setup/pytorch/neuronx/ubuntu/torch-neuronx-ubuntu22.html#setup-torch-neuronx-ubuntu22.
@sumaiyah were you able to generate sequences from the model.
I did manage to generate sequences using this model
I followed this guide
with this AMI Deep Learning AMI Neuron PyTorch 1.13 (Ubuntu 20.04) 20230912
I then ran
pip install sentencepiece transformers transformers-neuronx
and then ran the notebook code
Hi @sumaiyah I recreated the instance and followed the steps given in the link you shared, but still I could not get the model running; following is the error,
{
"name": "RuntimeError",
"message": "Failed compilation with ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/neuroncc_compile_workdir/b306f2a9-16a6-468b-851f-4cfa27068af0/model.MODULE_cd5e0485cb697fcd2bf8+a1fe2fe5.hlo.pb', '--output', '/tmp/neuroncc_compile_workdir/b306f2a9-16a6-468b-851f-4cfa27068af0/model.MODULE_cd5e0485cb697fcd2bf8+a1fe2fe5.neff', '--model-type=transformer', '--model-type=transformer', '--verbose=35']: 2023-09-19T17:23:52Z Too many instructions after unroll for function sg0000 !
",
"stack": "---------------------------------------------------------------------------
_RemoteTraceback Traceback (most recent call last)
_RemoteTraceback:
\"\"\"
Traceback (most recent call last):
File \"/usr/lib/python3.8/concurrent/futures/process.py\", line 239, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File \"/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/compiler.py\", line 411, in compile
self.build(tag=tag)
File \"/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/compiler.py\", line 418, in build
self.neff_bytes = compile_hlo_module(self.hlo_module, tag)
File \"/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/compiler.py\", line 95, in compile_hlo_module
neff_bytes = neuron_xla_compile(module_bytes, flags, input_format=\"hlo\", platform_target=\"trn1\",
File \"/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/libneuronxla/__init__.py\", line 38, in neuron_xla_compile
_neuron_cc_wrapper.neuron_xla_compile(
File \"/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/libneuronxla/neuron_cc_wrapper.py\", line 267, in neuron_xla_compile
ret = compile_with_cache(output, compile_cache, cache_key, execution_mode,
File \"/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/libneuronxla/neuron_cc_wrapper.py\", line 201, in compile_with_cache
raise(e)
File \"/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/libneuronxla/neuron_cc_wrapper.py\", line 181, in compile_with_cache
ret = call_neuron_compiler(
File \"/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/libneuronxla/neuron_cc_wrapper.py\", line 129, in call_neuron_compiler
raise RuntimeError(f\"Failed compilation with {cmd}: {res.stderr.decode()}\")
RuntimeError: Failed compilation with ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/neuroncc_compile_workdir/b306f2a9-16a6-468b-851f-4cfa27068af0/model.MODULE_cd5e0485cb697fcd2bf8+a1fe2fe5.hlo.pb', '--output', '/tmp/neuroncc_compile_workdir/b306f2a9-16a6-468b-851f-4cfa27068af0/model.MODULE_cd5e0485cb697fcd2bf8+a1fe2fe5.neff', '--model-type=transformer', '--model-type=transformer', '--verbose=35']: 2023-09-19T17:23:52Z Too many instructions after unroll for function sg0000 !
\"\"\"
The above exception was the direct cause of the following exception:
RuntimeError Traceback (most recent call last)
/home/ubuntu/llama-2/01-llama-2-7b-chat-neuronx.ipynb Cell 7 line 1
<a href='vscode-notebook-cell://ssh-remote%2Binf2-8xlarge/home/ubuntu/llama-2/01-llama-2-7b-chat-neuronx.ipynb#W6sdnNjb2RlLXJlbW90ZQ%3D%3D?line=8'>9</a> # load meta-llama/Llama-2-7b-chat to the NeuronCores with 2-way tensor parallelism and run compilation
<a href='vscode-notebook-cell://ssh-remote%2Binf2-8xlarge/home/ubuntu/llama-2/01-llama-2-7b-chat-neuronx.ipynb#W6sdnNjb2RlLXJlbW90ZQ%3D%3D?line=9'>10</a> neuron_model = LlamaForSampling.from_pretrained('llama-2-7b-chat-hf-chunked', batch_size=1, tp_degree=2, amp='f16')
---> <a href='vscode-notebook-cell://ssh-remote%2Binf2-8xlarge/home/ubuntu/llama-2/01-llama-2-7b-chat-neuronx.ipynb#W6sdnNjb2RlLXJlbW90ZQ%3D%3D?line=10'>11</a> neuron_model.to_neuron()
File /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/llama/model.py:122, in LlamaForSampling.to_neuron(self)
120 self.decoder_lm_head_for_context = {}
121 for context_length_estimate in self.context_buckets:
--> 122 model = self.decoder_lm_head.build_weight_shared(
123 n_positions_list=[context_length_estimate],
124 n_active_tokens=context_length_estimate,
125 unroll=self.context_unroll,
126 share_caches=True,
127 )
128 # PERF: No latency improvement seen in multi-layer models from executor
129 if self.context_unroll == self.config.num_hidden_layers:
File /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/decoder.py:163, in DecoderLmHeadForSamplingNoEmbedding.build_weight_shared(self, n_positions_list, n_active_tokens, batch_size, unroll, share_caches)
161 ln_lm_head_params.append(new.lm_head_bias)
162 new.program = new._build_program()
--> 163 new.program.setup(new.layers, ln_lm_head_params)
164 return new
File /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/decoder.py:1029, in DecoderProgramFullyUnrolled.setup(self, layers, ln_lm_head_params)
1028 def setup(self, layers, ln_lm_head_params):
-> 1029 super().setup(layers, ln_lm_head_params)
1030 for npos, memory in zip(self.n_positions_list, self.memories):
1031 input_tensors = [*self.input_buffers]
File /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/decoder.py:919, in DecoderProgram.setup(self, layers, ln_lm_head_params, io_ring_cache_size)
917 neff_bytes_futures.append(future)
918 for kernel, future in zip(self.kernels, neff_bytes_futures):
--> 919 kernel.neff_bytes = future.result()
921 for kernel in self.kernels:
922 kernel.load(io_ring_cache_size)
File /usr/lib/python3.8/concurrent/futures/_base.py:444, in Future.result(self, timeout)
442 raise CancelledError()
443 elif self._state == FINISHED:
--> 444 return self.__get_result()
445 else:
446 raise TimeoutError()
File /usr/lib/python3.8/concurrent/futures/_base.py:389, in Future.__get_result(self)
387 if self._exception:
388 try:
--> 389 raise self._exception
390 finally:
391 # Break a reference cycle with the exception in self._exception
392 self = None
RuntimeError: Failed compilation with ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/neuroncc_compile_workdir/b306f2a9-16a6-468b-851f-4cfa27068af0/model.MODULE_cd5e0485cb697fcd2bf8+a1fe2fe5.hlo.pb', '--output', '/tmp/neuroncc_compile_workdir/b306f2a9-16a6-468b-851f-4cfa27068af0/model.MODULE_cd5e0485cb697fcd2bf8+a1fe2fe5.neff', '--model-type=transformer', '--model-type=transformer', '--verbose=35']: 2023-09-19T17:23:52Z Too many instructions after unroll for function sg0000 !
"
}
I tried setting it up again and it looks like I get this error too, however when i try to run the next lines in the notebook to generate a sequence they work (depite the runtime error) which I believe is another issue to be raised.
Did you try running all the way down to the neuron_model.sample even after the error?
Thank you for creating the issue. We are able to reproduce the issue with the latest release. We are working on a fix and would send out an update once we have a solution
Update: We are working on a fix and should have it in the upcoming releases. However, you can unblock yourself by using os.environ["NEURON_CC_FLAGS"] = "-O1"
.
HI sushantMoon, Were you able to try the suggested "-O1" compiler option to see if that unblocked you?
When I want to load the model for inference following the steps give on reference file: https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/meta-llama-2-13b-sampling.ipynb
Following is the error I receive,
Also, in the reference file, they use
tp_degree=24
when working with inf2.48xlarge which has 384 GB of Accelerator Memory, since I am working with inf2.8xlarge with 32 GB of Accelerator memory, I usedtp_degree=2
I have the following versions of the dependencies installed,