Open ak-org opened 1 month ago
Thank you for creating the issue. Can you share a bit more about the setup. To be specific:
Note: If you are using a llama3-70B with 8K seq-length (as against our 4K example), the activation memory would go up. This increase in activation memory would result in high scratch-pad usage (as shown in the warning). You can try using a higher TP degree, for example using two instances (though this will slow down perf). It may also be possible to reduce batch size enough to work on one instance.
We are also working on an upcoming feature which should help in future releases, so please watch out for announcements.
HI, I am using two trn1.32xlarge instances. I had batch size of 8 sequence length of 4096 and tp degree of 32.
I will retry with smaller batch size and report back the outcome.
I tried smaller tp_degree and batch sizes, but compilation still failed.
AlgorithmError: ExecuteUserScriptError: ExitCode 1 ErrorMessage "RuntimeError: Parent directory /opt/ml/output/data/checkpoint-1000/shards/model/dp_rank_00_tp_rank_26_pp_rank_00.pt.tensors does not exist. Traceback (most recent call last) File "/opt/ml/code/train.py", line 115, in <module> trainer.train() File "/usr/local/lib/python3.10/site-packages/optimum/neuron/trainers.py", line 1328, in train result = super().train( File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1537, in train return inner_training_loop( File "/usr/local/lib/python3.10/site-packages/optimum/neuron/utils/require_utils.py", line 50, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.10/site-packages/optimum/neuron/trainers.py", line 990, in _inner_training_loop self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval) File "/usr/local/lib/python3.10/site-packages/optimum/neuron/trainers.py", line 443, in _maybe_log_save_evaluate self._save_checkpoint(mode
I am trying to finetune llama3-70B on trn132xlarge using distributed training. It failed with following error:
Container image: f"763104351884.dkr.ecr.{region}.amazonaws.com/pytorch-training-neuronx:1.13.1-neuronx-py310-sdk2.18.2-ubuntu20.04",
model_id: "meta-llama/Meta-Llama-3-70B"