Compilation failed for llama3-70B model - Estimated peak HBM usage (22.839451) exceeds 16GB. Neff won't be able to load on chip

ak-org commented 1 month ago

I am trying to finetune llama3-70B on trn132xlarge using distributed training. It failed with following error:

Container image: f"763104351884.dkr.ecr.{region}.amazonaws.com/pytorch-training-neuronx:1.13.1-neuronx-py310-sdk2.18.2-ubuntu20.04",

model_id: "meta-llama/Meta-Llama-3-70B"

RuntimeError: Failed compilation with ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/no-user/neuroncc_compile_workdir/0fe2f6bd-7622-4e70-840e-f46008169a5e/model.MODULE_13231525432959154856+55d6a20f.hlo.pb', '--output', '/tmp/no-user/neuroncc_compile_workdir/0fe2f6bd-7622-4e70-840e-f46008169a5e/model.MODULE_13231525432959154856+55d6a20f.neff', '--model-type=transformer', '--distribution-strategy=llm-training', '--enable-saturate-infinity', '-O2', '--model-type=transformer', '--verbose=35']: 2024-05-10T23:47:40Z [XCG815]  Estimated peak HBM usage (22.839451) exceeds 16GB. Neff won't be able to load on chip - Please open a support ticket at https://github.com/aws-neuron/aws-neuron-sdk/issues/new

mrnikwaws commented 1 month ago

Thank you for creating the issue. Can you share a bit more about the setup. To be specific:

Which code example are you using as reference?
What is the seq-length, batchsize?
What is the TP, PP, DP degree for your example?

Note: If you are using a llama3-70B with 8K seq-length (as against our 4K example), the activation memory would go up. This increase in activation memory would result in high scratch-pad usage (as shown in the warning). You can try using a higher TP degree, for example using two instances (though this will slow down perf). It may also be possible to reduce batch size enough to work on one instance.

We are also working on an upcoming feature which should help in future releases, so please watch out for announcements.

ak-org commented 1 month ago

HI, I am using two trn1.32xlarge instances. I had batch size of 8 sequence length of 4096 and tp degree of 32.

I will retry with smaller batch size and report back the outcome.

ak-org commented 1 month ago

I tried smaller tp_degree and batch sizes, but compilation still failed. AlgorithmError: ExecuteUserScriptError: ExitCode 1 ErrorMessage "RuntimeError: Parent directory /opt/ml/output/data/checkpoint-1000/shards/model/dp_rank_00_tp_rank_26_pp_rank_00.pt.tensors does not exist. Traceback (most recent call last) File "/opt/ml/code/train.py", line 115, in <module> trainer.train() File "/usr/local/lib/python3.10/site-packages/optimum/neuron/trainers.py", line 1328, in train result = super().train( File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1537, in train return inner_training_loop( File "/usr/local/lib/python3.10/site-packages/optimum/neuron/utils/require_utils.py", line 50, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.10/site-packages/optimum/neuron/trainers.py", line 990, in _inner_training_loop self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval) File "/usr/local/lib/python3.10/site-packages/optimum/neuron/trainers.py", line 443, in _maybe_log_save_evaluate self._save_checkpoint(mode

aws-neuron / aws-neuron-sdk

Compilation failed for llama3-70B model - Estimated peak HBM usage (22.839451) exceeds 16GB. Neff won't be able to load on chip #884