Rajmehta123 commented 11 months ago

I have inf2.24xlarge and I am running the Llama-2 inference example. All the packages are installed latest.

Everything worked fine until the step where I load model with tp_degree = 24 and it failed.

import time
import torch
from transformers import AutoTokenizer
from transformers_neuronx.llama.model import LlamaForSampling

# load meta-llama/Llama-2-13b to the NeuronCores with 24-way tensor parallelism and run compilation
neuron_model = LlamaForSampling.from_pretrained('./split_llm', batch_size=1, tp_degree=24, amp='f16')
neuron_model.to_neuron()

And it fails with below error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[1], line 8
      6 # load meta-llama/Llama-2-13b to the NeuronCores with 24-way tensor parallelism and run compilation
      7 neuron_model = LlamaForSampling.from_pretrained('/home/ec2-user/.gccc/split_everythinglm', batch_size=1, tp_degree=24, amp='f16')
----> 8 neuron_model.to_neuron()

File ~/.conda/envs/inf2/lib/python3.10/site-packages/transformers_neuronx/llama/model.py:105, in LlamaForSampling.to_neuron(self)
    102     new_layer.add_parameter(mlp.up_proj.weight.T, sharding=1, allow_pad=True, allow_quantize=True)
    103     new_layer.add_parameter(mlp.down_proj.weight, sharding=1, allow_pad=True, allow_quantize=True, out_feature_dim=0)
--> 105     new_layer.to_neuron()
    106     layer.nullify()
    108 ln_f = self.chkpt_model.model.norm

File ~/.conda/envs/inf2/lib/python3.10/site-packages/transformers_neuronx/decoder.py:579, in DecoderLayer.to_neuron(self)
    577 maybe_shard_along = maybe_manipulator.shard_along
    578 maybe_primary_only = maybe_manipulator.primary_only
--> 579 self.pre_attn_ln_weight = maybe_duplicate(self.pre_attn_ln_weight)
    580 self.pre_attn_ln_bias = maybe_duplicate(self.pre_attn_ln_bias)
    581 self.attn_q_weight = maybe_shard_along(self.attn_q_weight, dim=1)

File ~/.conda/envs/inf2/lib/python3.10/site-packages/transformers_neuronx/decoder.py:856, in MaybeParallelTensorManipulator.duplicate(self, tensor)
    854 if tensor is None:
    855     return None
--> 856 return self.manipulator.duplicate(tensor)

File ~/.conda/envs/inf2/lib/python3.10/site-packages/transformers_neuronx/parallel.py:95, in ParallelTensorManipulator.duplicate(self, tensor)
     94 def duplicate(self, tensor):
---> 95     return ops.parallel_to_nc([tensor for ordinal in range(self.tp_degree)])

File ~/.conda/envs/inf2/lib/python3.10/site-packages/transformers_neuronx/ops.py:49, in parallel_to_nc(tensors)
     48 def parallel_to_nc(tensors):
---> 49     return torch.ops.neuron._parallel_to_neuron(tensors)

File ~/.conda/envs/inf2/lib/python3.10/site-packages/torch/_ops.py:442, in OpOverloadPacket.__call__(self, *args, **kwargs)
    437 def __call__(self, *args, **kwargs):
    438     # overloading __call__ to ensure torch.ops.foo.bar()
    439     # is still callable from JIT
    440     # We save the function ptr as the `op` attribute on
    441     # OpOverloadPacket to access it here.
--> 442     return self._op(*args, **kwargs or {})

RuntimeError: nrt_tensor_allocate status=2

aws-rhsoln commented 11 months ago

Thank you for reporting the issue. It looks like you are trying to run with tp_degree=24, which would try to launch 24 workers, each worker having a model shard. However, inf2.24xl only has 12 neuron cores, hence maximum you can launch 12 neuron workers, in other words can use tp=12 on 24xl machine. To use tp=24, you would have to move to a larger instance (inf2.48xl)

Rajmehta123 commented 11 months ago

Got this error after changing it to 12.

RuntimeError: Failed compilation with ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/neuroncc_compile_workdir/eeb5a092-48d6-4fe6-a3da-3a8ece2aea4f/model.MODULE_97cc0ec81fdef38a65e3+eefbc487.hlo.pb', '--output', '/tmp/neuroncc_compile_workdir/eeb5a092-48d6-4fe6-a3da-3a8ece2aea4f/model.MODULE_97cc0ec81fdef38a65e3+eefbc487.neff', '--model-type=transformer', '--verbose=35']: 2023-09-22T21:47:03Z Too many instructions after unroll for function sg0000 !

Rajmehta123 commented 11 months ago

Also, I am using this model. Derivative of Llama 2

https://huggingface.co/totally-not-an-llm/EverythingLM-13b-V2-16k

tahsintahsin commented 11 months ago

Hello, Same is failing for me, with the same error message, pasted below. Setup I am using is ;

inferentia2 48 xlarge
prebuilt hf code from this repo https://huggingface.co/meta-llama/Llama-2-13b-hf/tree/main
code is exact same as you are using in your example notebook

output: 2023-09-25 09:54:18.000850: 3889 INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache 2023-09-25 09:54:18.000851: 3888 INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache 2023-09-25 09:54:18.000857: 3889 INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/neuroncc_compile_workdir/81a792b2-2457-4d1d-94af-00124ad24117/model.MODULE_6188ce6038394aa3675e+eefbc487.hlo.pb', '--output', '/tmp/neuroncc_compile_workdir/81a792b2-2457-4d1d-94af-00124ad24117/model.MODULE_6188ce6038394aa3675e+eefbc487.neff', '--model-type=transformer', '--verbose=35'] 2023-09-25 09:54:18.000858: 3888 INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/neuroncc_compile_workdir/02d65116-57de-4670-823d-2611b7170e6c/model.MODULE_105c2c7d569bf93aafe3+eefbc487.hlo.pb', '--output', '/tmp/neuroncc_compile_workdir/02d65116-57de-4670-823d-2611b7170e6c/model.MODULE_105c2c7d569bf93aafe3+eefbc487.neff', '--model-type=transformer', '--verbose=35'] 2023-09-25 09:54:18.000867: 3890 INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache 2023-09-25 09:54:18.000871: 3890 INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/neuroncc_compile_workdir/cc93d3cd-f2e2-42c1-9f14-f15032ed1c57/model.MODULE_9a4305d83cac026e6197+eefbc487.hlo.pb', '--output', '/tmp/neuroncc_compile_workdir/cc93d3cd-f2e2-42c1-9f14-f15032ed1c57/model.MODULE_9a4305d83cac026e6197+eefbc487.neff', '--model-type=transformer', '--verbose=35'] 2023-09-25 09:54:18.000904: 3891 INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache 2023-09-25 09:54:18.000908: 3891 INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/neuroncc_compile_workdir/3db07a9f-1c43-4f11-8d29-1f07d16795d4/model.MODULE_5d75c0a28bdf9e1d723f+eefbc487.hlo.pb', '--output', '/tmp/neuroncc_compile_workdir/3db07a9f-1c43-4f11-8d29-1f07d16795d4/model.MODULE_5d75c0a28bdf9e1d723f+eefbc487.neff', '--model-type=transformer', '--verbose=35'] 2023-09-25 09:54:18.000943: 3892 INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache 2023-09-25 09:54:18.000948: 3892 INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/neuroncc_compile_workdir/b9383bcb-dffc-4660-840f-018762961cb8/model.MODULE_77482cd8489f2f78889f+eefbc487.hlo.pb', '--output', '/tmp/neuroncc_compile_workdir/b9383bcb-dffc-4660-840f-018762961cb8/model.MODULE_77482cd8489f2f78889f+eefbc487.neff', '--model-type=transformer', '--verbose=35'] .......................................................

Compiler status PASS

Compiler status PASS 2023-09-25 09:58:05.000854: 5724 INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache 2023-09-25 09:58:05.000864: 5724 INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/neuroncc_compile_workdir/b7136e1b-b38b-4473-871f-732c8bb448bf/model.MODULE_90dcf3ff4de7045473cb+eefbc487.hlo.pb', '--output', '/tmp/neuroncc_compile_workdir/b7136e1b-b38b-4473-871f-732c8bb448bf/model.MODULE_90dcf3ff4de7045473cb+eefbc487.neff', '--model-type=transformer', '--verbose=35'] ........

Compiler status PASS 2023-09-25 10:00:34.000110: 6077 INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache 2023-09-25 10:00:34.000120: 6077 INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/neuroncc_compile_workdir/931eb18f-0f0c-46ef-a9bf-989626396d75/model.MODULE_a86b2ce5e842b851c848+eefbc487.hlo.pb', '--output', '/tmp/neuroncc_compile_workdir/931eb18f-0f0c-46ef-a9bf-989626396d75/model.MODULE_a86b2ce5e842b851c848+eefbc487.neff', '--model-type=transformer', '--verbose=35'] ......

Compiler status PASS 2023-09-25 10:02:26.000627: 6428 INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache 2023-09-25 10:02:26.000640: 6428 INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/neuroncc_compile_workdir/31d5e5c9-18f6-4952-ab20-28ca3e244324/model.MODULE_303eb04c10df495dc2fe+eefbc487.hlo.pb', '--output', '/tmp/neuroncc_compile_workdir/31d5e5c9-18f6-4952-ab20-28ca3e244324/model.MODULE_303eb04c10df495dc2fe+eefbc487.neff', '--model-type=transformer', '--verbose=35'] ......

Compiler status PASS 2023-09-25 10:04:23.000884: 6781 INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache 2023-09-25 10:04:23.000895: 6781 INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/neuroncc_compile_workdir/75bc1f4b-7e68-4ee7-8ed3-bc1f2b3d5088/model.MODULE_7aca80adf8957810d8cd+eefbc487.hlo.pb', '--output', '/tmp/neuroncc_compile_workdir/75bc1f4b-7e68-4ee7-8ed3-bc1f2b3d5088/model.MODULE_7aca80adf8957810d8cd+eefbc487.neff', '--model-type=transformer', '--verbose=35'] ......

Compiler status PASS 2023-09-25 10:06:25.000610: 7133 INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache 2023-09-25 10:06:25.000621: 7133 INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/neuroncc_compile_workdir/fc406828-9040-47cd-b20b-cee6a02e9034/model.MODULE_5a7131302f23f3655de9+eefbc487.hlo.pb', '--output', '/tmp/neuroncc_compile_workdir/fc406828-9040-47cd-b20b-cee6a02e9034/model.MODULE_5a7131302f23f3655de9+eefbc487.neff', '--model-type=transformer', '--verbose=35'] ...2023-09-25 10:07:13.000980: 7133 ERROR ||NEURON_CC_WRAPPER||: Compilation failed for /tmp/neuroncc_compile_workdir/fc406828-9040-47cd-b20b-cee6a02e9034/model.MODULE_5a7131302f23f3655de9+eefbc487.hlo.pb after 0 retries.

_RemoteTraceback Traceback (most recent call last) _RemoteTraceback: """ Traceback (most recent call last): File "/usr/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker r = call_item.fn(*call_item.args, **call_item.kwargs) File "/usr/local/lib/python3.10/dist-packages/transformers_neuronx/compiler.py", line 411, in compile self.build(tag=tag) File "/usr/local/lib/python3.10/dist-packages/transformers_neuronx/compiler.py", line 418, in build self.neff_bytes = compile_hlo_module(self.hlo_module, tag) File "/usr/local/lib/python3.10/dist-packages/transformers_neuronx/compiler.py", line 95, in compile_hlo_module neff_bytes = neuron_xla_compile(module_bytes, flags, input_format="hlo", platform_target="trn1", File "/usr/local/lib/python3.10/dist-packages/libneuronxla/init.py", line 38, in neuron_xla_compile _neuron_cc_wrapper.neuron_xla_compile( File "/usr/local/lib/python3.10/dist-packages/libneuronxla/neuron_cc_wrapper.py", line 267, in neuron_xla_compile ret = compile_with_cache(output, compile_cache, cache_key, execution_mode, File "/usr/local/lib/python3.10/dist-packages/libneuronxla/neuron_cc_wrapper.py", line 201, in compile_with_cache raise(e) File "/usr/local/lib/python3.10/dist-packages/libneuronxla/neuron_cc_wrapper.py", line 181, in compile_with_cache ret = call_neuron_compiler( File "/usr/local/lib/python3.10/dist-packages/libneuronxla/neuron_cc_wrapper.py", line 129, in call_neuron_compiler raise RuntimeError(f"Failed compilation with {cmd}: {res.stderr.decode()}") RuntimeError: Failed compilation with ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/neuroncc_compile_workdir/fc406828-9040-47cd-b20b-cee6a02e9034/model.MODULE_5a7131302f23f3655de9+eefbc487.hlo.pb', '--output', '/tmp/neuroncc_compile_workdir/fc406828-9040-47cd-b20b-cee6a02e9034/model.MODULE_5a7131302f23f3655de9+eefbc487.neff', '--model-type=transformer', '--verbose=35']: 2023-09-25T10:07:13Z Too many instructions after unroll for function sg0000 !

"""

The above exception was the direct cause of the following exception:

RuntimeError Traceback (most recent call last) Cell In[5], line 1 ----> 1 neuron_model.to_neuron()

File /usr/local/lib/python3.10/dist-packages/transformers_neuronx/llama/model.py:122, in LlamaForSampling.to_neuron(self) 120 self.decoder_lm_head_for_context = {} 121 for context_length_estimate in self.context_buckets: --> 122 model = self.decoder_lm_head.build_weight_shared( 123 n_positions_list=[context_length_estimate], 124 n_active_tokens=context_length_estimate, 125 unroll=self.context_unroll, 126 share_caches=True, 127 ) 128 # PERF: No latency improvement seen in multi-layer models from executor 129 if self.context_unroll == self.config.num_hidden_layers:

File /usr/local/lib/python3.10/dist-packages/transformers_neuronx/decoder.py:163, in DecoderLmHeadForSamplingNoEmbedding.build_weight_shared(self, n_positions_list, n_active_tokens, batch_size, unroll, share_caches) 161 ln_lm_head_params.append(new.lm_head_bias) 162 new.program = new._build_program() --> 163 new.program.setup(new.layers, ln_lm_head_params) 164 return new

File /usr/local/lib/python3.10/dist-packages/transformers_neuronx/decoder.py:1029, in DecoderProgramFullyUnrolled.setup(self, layers, ln_lm_head_params) 1028 def setup(self, layers, ln_lm_head_params): -> 1029 super().setup(layers, ln_lm_head_params) 1030 for npos, memory in zip(self.n_positions_list, self.memories): 1031 input_tensors = [*self.input_buffers]

File /usr/local/lib/python3.10/dist-packages/transformers_neuronx/decoder.py:919, in DecoderProgram.setup(self, layers, ln_lm_head_params, io_ring_cache_size) 917 neff_bytes_futures.append(future) 918 for kernel, future in zip(self.kernels, neff_bytes_futures): --> 919 kernel.neff_bytes = future.result() 921 for kernel in self.kernels: 922 kernel.load(io_ring_cache_size)

File /usr/lib/python3.10/concurrent/futures/_base.py:458, in Future.result(self, timeout) 456 raise CancelledError() 457 elif self._state == FINISHED: --> 458 return self.__get_result() 459 else: 460 raise TimeoutError()

File /usr/lib/python3.10/concurrent/futures/_base.py:403, in Future.__get_result(self) 401 if self._exception: 402 try: --> 403 raise self._exception 404 finally: 405 # Break a reference cycle with the exception in self._exception 406 self = None

RuntimeError: Failed compilation with ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/neuroncc_compile_workdir/fc406828-9040-47cd-b20b-cee6a02e9034/model.MODULE_5a7131302f23f3655de9+eefbc487.hlo.pb', '--output', '/tmp/neuroncc_compile_workdir/fc406828-9040-47cd-b20b-cee6a02e9034/model.MODULE_5a7131302f23f3655de9+eefbc487.neff', '--model-type=transformer', '--verbose=35']: 2023-09-25T10:07:13Z Too many instructions after unroll for function sg0000 !

aws-rhsoln commented 11 months ago

We have reproduced the problem with release 2.14 and will fix in an upcoming release. To work-around this issue for TP=12 on inf2.24xl, please add os.environ["NEURON_CC_FLAGS"] = "-O1" in your script after the imports.

awsilya commented 11 months ago

@Rajmehta123 @tahsintahsin Could you please try using just released 2.14.1 ?

tahsintahsin commented 11 months ago

Hello @awsilya I just tried and got the same error again I reinstalled the torch neuronx environment from scratch as given in the documentation (which I think updates the package?), Afterwards I ran the Jupyter notebook, with tp=24 and tp=12, none of them worked, In both cases I got the same error again

awsilya commented 11 months ago

@tahsintahsin strange, I successfully compiled llama-2 7b and 13b for batch 1,2 and 4 using 2.14.1. Could you check the versions of Neuron components installed on the instance and compare to the list here: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/index.html#neuron-2-14-1-09-26-2023 ?

$ pip list | grep neuron

BTW, batch 4 compilation takes a long time, it's a known issue that will be addressed in the next release.

tahsintahsin commented 11 months ago

I am working on batch 1 for now,

The list of packages that I get in my environment is :

aws-neuronx-runtime-discovery 2.9 libneuronxla 0.5.476 neuronx-hwm 2.10.0.5+7b1976adf torch-neuronx 1.13.1.1.11.0 torch-xla 1.13.1+torchneuronb transformers-neuronx 0.7.84 neuronx-cc 2.10.0.35+3817a0c8c

This seems the same in the list in the link you shared, however, there are extra things on the list, which doesn't appear in this list, and I don't know how to check them, might they be causing the issue ?

Another question I have is, again in the link you shared, there is the warning about using --optlevel 1 (or -O1) compiler flag

However, as I am running the sample Jupyter notebook you shared, I am not using this flag, if I have to, how can I run it in the notebook? and can you update the sample notebook accordingly ?

Thanks again

tahsintahsin commented 11 months ago

sorry I just saw you showed how to pass that flag, I am trying with that once more

tahsintahsin commented 11 months ago

ok it finally works with the compiler flag enabled :) thank you so much ! @awsilya

tahsintahsin commented 11 months ago

Also, tried to save my model but classical save ways didnt work, I think we are supposed to use this ?

Save the compiled Neuron model

model_neuron._save_compiled_artifacts('gpt2-neuron')

Load the Neuron model

model_neuron = GPT2ForSampling.from_pretrained('gpt2-split', batch_size=1, tp_degree=2, n_positions=256, amp='f32', unroll=None) model_neuron._load_compiled_artifacts('gpt2-neuron') # Load the compiled Neuron artifacts model_neuron.to_neuron() # Load the model weights but skip compilation

I found this from your documentation after some search in the examples on transformers, please let me know if it is the correct way

Also, for anyone curious, using inf2.48xl, top_k=2 and sequence_length=2048, it is taking around 50 secs to respond, which is far from production grade. @awsilya do you have any suggestions to improve it? (adding more inf2 servers, or directly using gpu's or anything else?)

aws-taylor commented 3 months ago

Hello @tahsintahsin,

Take a look at the profiling recommendations on https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/programming-guide/torch-neuronx-profiling-dev-guide.html. That may give you some visibility on where the 50 sec is coming from and expose some optimization opportunities.

chafik-c commented 2 months ago

Hi @tahsintahsin, I think the original issue was resolved on 10/4/23. Do you need more info post the comment about the profiler? Checking if this ticket can be closed and perhaps you could open a new one for other issues?

chafik-c commented 2 months ago

I see the question also about saving not working as well. I'll route it to one of our internal teams to get the answer.

aws-neuron / aws-neuron-sdk

Llama-2-13b inference example failing on inf2 #749

Save the compiled Neuron model

Load the Neuron model