aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
https://aws.amazon.com/machine-learning/neuron/
Other
459 stars 153 forks source link

Neuron compilation failed for LIama-2 70B with Optimum-neuron 0.0.17 within TGI-Neuron DLC #825

Open Neo9061 opened 9 months ago

Neo9061 commented 9 months ago

I started a inf2.48xlarge ec2, pull and get into TGI-Neuron DLC with optimum-neuron 0.0.17 installed, and running following code.

from optimum.neuron import NeuronModelForCausalLM
compiler_args = {"num_cores": 24, "auto_cast_type": 'fp16'}
input_shapes = {"batch_size": 8, "sequence_length": 2048}
model = NeuronModelForCausalLM.from_pretrained("70B", export=True, **compiler_args, **input_shapes)

It gives me error after more than 1 hours' compilation. Can anyone give instruction? many thanks!

>>> compiler_args = {"num_cores": 24, "auto_cast_type": 'fp16'}
>>> input_shapes = {"batch_size": 8, "sequence_length": 2048}
>>> model = NeuronModelForCausalLM.from_pretrained("neuron-2-16/70B", export=True, **compiler_args, **input_shapes)
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:20<00:00,  1.35s/it]
/usr/local/lib/python3.10/dist-packages/transformers_neuronx/decoder.py:150: UserWarning: KV head replication will be enabled since the number of KV heads (8) is not evenly divisible by the tensor parallel degree (24)
  warnings.warn(
2024-01-26 04:05:48.000121:  197  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-01-26 04:05:48.000276:  198  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-01-26 04:05:48.000404:  199  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
neuronxcc-2.12.54.0+f631c2365/MODULE_7596bede63ad7e9fce10+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.12.54.0+f631c2365/MODULE_6c82821df4803278d7ff+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.12.54.0+f631c2365/MODULE_7596bede63ad7e9fce10+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
2024-01-26 04:05:48.000556:  200  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
neuronxcc-2.12.54.0+f631c2365/MODULE_6c82821df4803278d7ff+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.12.54.0+f631c2365/MODULE_7596bede63ad7e9fce10+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
2024-01-26 04:05:48.000573:  198  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/no-user/neuroncc_compile_workdir/3360dbd5-e1a7-4631-942f-74900aa9137c/model.MODULE_7596bede63ad7e9fce10+2c2d707e.hlo.pb', '--output', '/tmp/no-user/neuroncc_compile_workdir/3360dbd5-e1a7-4631-942f-74900aa9137c/model.MODULE_7596bede63ad7e9fce10+2c2d707e.neff', '--model-type=transformer', '--auto-cast=none', '--verbose=35']
neuronxcc-2.12.54.0+f631c2365/MODULE_e840d65ecf77b5e70412+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.12.54.0+f631c2365/MODULE_6c82821df4803278d7ff+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
2024-01-26 04:05:48.000592:  197  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/no-user/neuroncc_compile_workdir/6fa34602-97b0-4317-9072-96283fdea504/model.MODULE_6c82821df4803278d7ff+2c2d707e.hlo.pb', '--output', '/tmp/no-user/neuroncc_compile_workdir/6fa34602-97b0-4317-9072-96283fdea504/model.MODULE_6c82821df4803278d7ff+2c2d707e.neff', '--model-type=transformer', '--auto-cast=none', '--verbose=35']
neuronxcc-2.12.54.0+f631c2365/MODULE_e840d65ecf77b5e70412+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.12.54.0+f631c2365/MODULE_e840d65ecf77b5e70412+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
2024-01-26 04:05:48.000630:  199  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/no-user/neuroncc_compile_workdir/ddc0f9e1-5af0-41e7-835e-b94e35125726/model.MODULE_e840d65ecf77b5e70412+2c2d707e.hlo.pb', '--output', '/tmp/no-user/neuroncc_compile_workdir/ddc0f9e1-5af0-41e7-835e-b94e35125726/model.MODULE_e840d65ecf77b5e70412+2c2d707e.neff', '--model-type=transformer', '--auto-cast=none', '--verbose=35']
neuronxcc-2.12.54.0+f631c2365/MODULE_cd73c69136bf675613f6+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.12.54.0+f631c2365/MODULE_cd73c69136bf675613f6+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
2024-01-26 04:05:48.000772:  201  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
neuronxcc-2.12.54.0+f631c2365/MODULE_cd73c69136bf675613f6+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
2024-01-26 04:05:48.000782:  200  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/no-user/neuroncc_compile_workdir/e8a51bad-9051-41de-a91c-76abb33928a4/model.MODULE_cd73c69136bf675613f6+2c2d707e.hlo.pb', '--output', '/tmp/no-user/neuroncc_compile_workdir/e8a51bad-9051-41de-a91c-76abb33928a4/model.MODULE_cd73c69136bf675613f6+2c2d707e.neff', '--model-type=transformer', '--auto-cast=none', '--verbose=35']
neuronxcc-2.12.54.0+f631c2365/MODULE_de2a864004f00393b3df+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.12.54.0+f631c2365/MODULE_de2a864004f00393b3df+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.12.54.0+f631c2365/MODULE_de2a864004f00393b3df+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
2024-01-26 04:05:49.000011:  201  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/no-user/neuroncc_compile_workdir/312c9a76-2fd7-4b53-bab3-e9bd16ca22f0/model.MODULE_de2a864004f00393b3df+2c2d707e.hlo.pb', '--output', '/tmp/no-user/neuroncc_compile_workdir/312c9a76-2fd7-4b53-bab3-e9bd16ca22f0/model.MODULE_de2a864004f00393b3df+2c2d707e.neff', '--model-type=transformer', '--auto-cast=none', '--verbose=35']
2024-01-26 04:05:49.000078:  202  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
neuronxcc-2.12.54.0+f631c2365/MODULE_bea9a614938782ef60f1+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
2024-01-26 04:05:49.000284:  203  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
neuronxcc-2.12.54.0+f631c2365/MODULE_bea9a614938782ef60f1+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.12.54.0+f631c2365/MODULE_bea9a614938782ef60f1+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
2024-01-26 04:05:49.000323:  202  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/no-user/neuroncc_compile_workdir/1e459a91-0133-4a38-888c-e8cff0e27bf0/model.MODULE_bea9a614938782ef60f1+2c2d707e.hlo.pb', '--output', '/tmp/no-user/neuroncc_compile_workdir/1e459a91-0133-4a38-888c-e8cff0e27bf0/model.MODULE_bea9a614938782ef60f1+2c2d707e.neff', '--model-type=transformer', '--auto-cast=none', '--verbose=35']
2024-01-26 04:05:49.000454:  204  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
neuronxcc-2.12.54.0+f631c2365/MODULE_19b06a4a2e3ec497765b+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.12.54.0+f631c2365/MODULE_19b06a4a2e3ec497765b+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.12.54.0+f631c2365/MODULE_19b06a4a2e3ec497765b+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
2024-01-26 04:05:49.000542:  203  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/no-user/neuroncc_compile_workdir/3cdae8e4-94ff-4356-9cf2-30ba470354dd/model.MODULE_19b06a4a2e3ec497765b+2c2d707e.hlo.pb', '--output', '/tmp/no-user/neuroncc_compile_workdir/3cdae8e4-94ff-4356-9cf2-30ba470354dd/model.MODULE_19b06a4a2e3ec497765b+2c2d707e.neff', '--model-type=transformer', '--auto-cast=none', '--verbose=35']
neuronxcc-2.12.54.0+f631c2365/MODULE_92936f8dd5a3b7722d78+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
2024-01-26 04:05:49.000759:  205  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-01-26 04:05:49.000831:  206  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
neuronxcc-2.12.54.0+f631c2365/MODULE_92936f8dd5a3b7722d78+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.12.54.0+f631c2365/MODULE_92936f8dd5a3b7722d78+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
2024-01-26 04:05:49.000891:  204  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/no-user/neuroncc_compile_workdir/af126343-bda8-4a29-94b3-93208be001d3/model.MODULE_92936f8dd5a3b7722d78+2c2d707e.hlo.pb', '--output', '/tmp/no-user/neuroncc_compile_workdir/af126343-bda8-4a29-94b3-93208be001d3/model.MODULE_92936f8dd5a3b7722d78+2c2d707e.neff', '--model-type=transformer', '--auto-cast=none', '--verbose=35']
neuronxcc-2.12.54.0+f631c2365/MODULE_10dc872570084745ca8e+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.12.54.0+f631c2365/MODULE_10dc872570084745ca8e+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.12.54.0+f631c2365/MODULE_aa0490bcda760c9f79d3+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.12.54.0+f631c2365/MODULE_10dc872570084745ca8e+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
2024-01-26 04:05:50.000020:  205  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/no-user/neuroncc_compile_workdir/02fe8756-a650-4ff7-805c-714d3c480dfa/model.MODULE_10dc872570084745ca8e+2c2d707e.hlo.pb', '--output', '/tmp/no-user/neuroncc_compile_workdir/02fe8756-a650-4ff7-805c-714d3c480dfa/model.MODULE_10dc872570084745ca8e+2c2d707e.neff', '--model-type=transformer', '--auto-cast=none', '--verbose=35']
neuronxcc-2.12.54.0+f631c2365/MODULE_aa0490bcda760c9f79d3+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.12.54.0+f631c2365/MODULE_aa0490bcda760c9f79d3+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
2024-01-26 04:05:50.000066:  206  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/no-user/neuroncc_compile_workdir/8525b53a-a081-4dbb-9e47-16fd2abf12fb/model.MODULE_aa0490bcda760c9f79d3+2c2d707e.hlo.pb', '--output', '/tmp/no-user/neuroncc_compile_workdir/8525b53a-a081-4dbb-9e47-16fd2abf12fb/model.MODULE_aa0490bcda760c9f79d3+2c2d707e.neff', '--model-type=transformer', '--auto-cast=none', '--verbose=35']
........................................................................................................................................................................................................................................................................................
Compiler status PASS

Compiler status PASS
........
Compiler status PASS

Compiler status PASS
............
Compiler status PASS
...............
2024-01-26 04:17:05.000953:  199  ERROR ||NEURON_CC_WRAPPER||: Compilation failed for /tmp/no-user/neuroncc_compile_workdir/ddc0f9e1-5af0-41e7-835e-b94e35125726/model.MODULE_e840d65ecf77b5e70412+2c2d707e.hlo.pb after 0 retries.
................
Compiler status PASS

2024-01-26 04:18:32.000969:  201  ERROR ||NEURON_CC_WRAPPER||: Compilation failed for /tmp/no-user/neuroncc_compile_workdir/312c9a76-2fd7-4b53-bab3-e9bd16ca22f0/model.MODULE_de2a864004f00393b3df+2c2d707e.hlo.pb after 0 retries.
........................
Compiler status PASS
........................................................................................................                      .........................................................................................................

...........................................................................
Compiler status PASS
concurrent.futures.process._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers_neuronx/compiler.py", line 430, in compile
    self.build(tag=tag)
  File "/usr/local/lib/python3.10/dist-packages/transformers_neuronx/compiler.py", line 437, in build
    self.neff_bytes = compile_hlo_module(self.hlo_module, tag)
  File "/usr/local/lib/python3.10/dist-packages/transformers_neuronx/compiler.py", line 115, in compile_hlo_module
    neff_bytes = neuron_xla_compile(module_bytes, flags, input_format="hlo", platform_target="trn1",
  File "/usr/local/lib/python3.10/dist-packages/libneuronxla/__init__.py", line 38, in neuron_xla_compile
    _neuron_cc_wrapper.neuron_xla_compile(
  File "/usr/local/lib/python3.10/dist-packages/libneuronxla/neuron_cc_wrapper.py", line 266, in neuron_xla_compile
    ret = compile_with_cache(output, compile_cache, cache_key, execution_mode,
  File "/usr/local/lib/python3.10/dist-packages/libneuronxla/neuron_cc_wrapper.py", line 199, in compile_with_cache
    raise(e)
  File "/usr/local/lib/python3.10/dist-packages/libneuronxla/neuron_cc_wrapper.py", line 178, in compile_with_cache
    ret = call_neuron_compiler(
  File "/usr/local/lib/python3.10/dist-packages/libneuronxla/neuron_cc_wrapper.py", line 126, in call_neuron_compiler
    raise RuntimeError(f"Failed compilation with {cmd}: {res.stderr.decode()}")
RuntimeError: Failed compilation with ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/no-user/neuroncc_compile_workdir/ddc0f9e1-5af0-41e7-835e-b94e35125726/model.MODULE_e840d65ecf77b5e70412+2c2d707e.hlo.pb', '--output', '/tmp/no-user/neuroncc_compile_workdir/ddc0f9e1-5af0-41e7-835e-b94e35125726/model.MODULE_e840d65ecf77b5e70412+2c2d707e.neff', '--model-type=transformer', '--auto-cast=none', '--verbose=35']: 2024-01-26T04:17:00Z [LUR015]  Compiler generated too many instructions (22498757). This maybe due to a failure in parallelism extraction by the tensorizer. - Please open a support ticket at https://github.com/aws-neuron/aws-neuron-sdk/issues/new

"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.10/dist-packages/optimum/modeling_base.py", line 372, in from_pretrained
    return from_pretrained_method(
  File "/usr/local/lib/python3.10/dist-packages/optimum/neuron/utils/require_utils.py", line 50, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/optimum/neuron/modeling_decoder.py", line 230, in _from_transformers
    return cls(config, checkpoint_dir, generation_config=generation_config)
  File "/usr/local/lib/python3.10/dist-packages/optimum/neuron/modeling.py", line 656, in __init__
    super().__init__(config, checkpoint_dir, compiled_dir=compiled_dir, generation_config=generation_config)
  File "/usr/local/lib/python3.10/dist-packages/optimum/neuron/utils/require_utils.py", line 50, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/optimum/neuron/modeling_decoder.py", line 128, in __init__
    neuronx_model.to_neuron()
  File "/usr/local/lib/python3.10/dist-packages/transformers_neuronx/base.py", line 64, in to_neuron
    self.compile()
  File "/usr/local/lib/python3.10/dist-packages/transformers_neuronx/base.py", line 51, in compile
    kernel.neff_bytes = neff_bytes_futures[hash_hlo(kernel.hlo_module)].result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
RuntimeError: Failed compilation with ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/no-user/neuroncc_compile_workdir/ddc0f9e1-5af0-41e7-835e-b94e35125726/model.MODULE_e840d65ecf77b5e70412+2c2d707e.hlo.pb', '--output', '/tmp/no-user/neuroncc_compile_workdir/ddc0f9e1-5af0-41e7-835e-b94e35125726/model.MODULE_e840d65ecf77b5e70412+2c2d707e.neff', '--model-type=transformer', '--auto-cast=none', '--verbose=35']: 2024-01-26T04:17:00Z [LUR015]  Compiler generated too many instructions (22498757). This maybe due to a failure in parallelism extraction by the tensorizer. - Please open a support ticket at https://github.com/aws-neuron/aws-neuron-sdk/issues/new
dacorvo commented 9 months ago

This is a compilation error, so technically not optimum-neuron related. However, I would suggest you try with a lower batch size, as even for batch size 6 and llama 2 13b I get OOM errors. Maybe start with batch_size 1 first, then 2 and finally 4.

jluntamazon commented 9 months ago

@dacorvo Thanks for suggesting smaller batch sizes!

@Neo9061 We will look at the specific sizing issue here to resolve the compilation error. This will likely only be available in a future release, so for now using a smaller batch size may be the best option.

One alternative you can try is to use the -O1 flag (synonym of --optlevel 1) which can often times allow you to compile larger models with some penalty to latency. (Compiler Option Reference). There was a known issue with prior versions of llama but it appears to also affect this configuration.

jyang-aws commented 9 months ago

@Neo9061 does the -O1 compilation flag help with your issue?