aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
https://aws.amazon.com/machine-learning/neuron/
Other
420 stars 136 forks source link

Compilation failed for llama3-70B model - Estimated peak HBM usage (22.839451) exceeds 16GB. Neff won't be able to load on chip #884

Open ak-org opened 1 month ago

ak-org commented 1 month ago

I am trying to finetune llama3-70B on trn132xlarge using distributed training. It failed with following error:

Container image: f"763104351884.dkr.ecr.{region}.amazonaws.com/pytorch-training-neuronx:1.13.1-neuronx-py310-sdk2.18.2-ubuntu20.04",

model_id: "meta-llama/Meta-Llama-3-70B"

RuntimeError: Failed compilation with ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/no-user/neuroncc_compile_workdir/0fe2f6bd-7622-4e70-840e-f46008169a5e/model.MODULE_13231525432959154856+55d6a20f.hlo.pb', '--output', '/tmp/no-user/neuroncc_compile_workdir/0fe2f6bd-7622-4e70-840e-f46008169a5e/model.MODULE_13231525432959154856+55d6a20f.neff', '--model-type=transformer', '--distribution-strategy=llm-training', '--enable-saturate-infinity', '-O2', '--model-type=transformer', '--verbose=35']: 2024-05-10T23:47:40Z [XCG815]  Estimated peak HBM usage (22.839451) exceeds 16GB. Neff won't be able to load on chip - Please open a support ticket at https://github.com/aws-neuron/aws-neuron-sdk/issues/new
mrnikwaws commented 1 month ago

Thank you for creating the issue. Can you share a bit more about the setup. To be specific:

Note: If you are using a llama3-70B with 8K seq-length (as against our 4K example), the activation memory would go up. This increase in activation memory would result in high scratch-pad usage (as shown in the warning). You can try using a higher TP degree, for example using two instances (though this will slow down perf). It may also be possible to reduce batch size enough to work on one instance.

We are also working on an upcoming feature which should help in future releases, so please watch out for announcements.

ak-org commented 1 month ago

HI, I am using two trn1.32xlarge instances. I had batch size of 8 sequence length of 4096 and tp degree of 32.

I will retry with smaller batch size and report back the outcome.

ak-org commented 1 month ago

I tried smaller tp_degree and batch sizes, but compilation still failed. AlgorithmError: ExecuteUserScriptError: ExitCode 1 ErrorMessage "RuntimeError: Parent directory /opt/ml/output/data/checkpoint-1000/shards/model/dp_rank_00_tp_rank_26_pp_rank_00.pt.tensors does not exist. Traceback (most recent call last) File "/opt/ml/code/train.py", line 115, in <module> trainer.train() File "/usr/local/lib/python3.10/site-packages/optimum/neuron/trainers.py", line 1328, in train result = super().train( File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1537, in train return inner_training_loop( File "/usr/local/lib/python3.10/site-packages/optimum/neuron/utils/require_utils.py", line 50, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.10/site-packages/optimum/neuron/trainers.py", line 990, in _inner_training_loop self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval) File "/usr/local/lib/python3.10/site-packages/optimum/neuron/trainers.py", line 443, in _maybe_log_save_evaluate self._save_checkpoint(mode