memory issue - Githubissues

Leosgp commented 6 months ago

Hello, I would like to know why I use pissa for tiny-llama or llama-7b three A100s, why it shows insufficient video memory? I do the same with tiny-llama, here are my parameters: CUDA_VISIBLE_DEVICES=3,5,7 yes | head -n 3 | python train.py \ --model_name_or_path /home/algo/pretrain_model/TinyLlama-1.1B-Chat-v1.0 \ --data_path data/train-00000-of-00005-a1278ede4e8c5cdb.json \ --dataset_split train[:10000] \ --dataset_field instruction output \ --output_dir /home/algo/zhengkaiyuan/b \ --init_lora_weights pissa \ --report_to wandb \ --merge_and_save True \ --bf16 True \ --num_train_epochs 1 \ --per_device_train_batch_size 1 \ --gradient_accumulation_steps 12 \ --save_strategy "steps" \ --save_steps 10000 \ --save_total_limit 1 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 True

fxmeng commented 6 months ago

Thank you for your interest in PiSSA, Could you provide more detailed error information? Additionally, if you need to train on multiple GPUs, it is recommended to include these two commands in your script, which will help reduce memory required:

--fsdp "full_shard auto_wrap" \
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \

Leosgp commented 6 months ago

Thank you for replying, and this is the error information :

File "train.py", line 216, in train() File "train.py", line 203, in train trainer.train() File "/anaconda3/envs/myenv/lib/python3.8/site-packages/transformers/trainer.py", line 1859, in train return inner_training_loop( File "/anaconda3/envs/myenv/lib/python3.8/site-packages/transformers/trainer.py", line 2203, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/anaconda3/envs/myenv/lib/python3.8/site-packages/transformers/trainer.py", line 3138, in training_step loss = self.compute_loss(model, inputs) File "/anaconda3/envs/myenv/lib/python3.8/site-packages/transformers/trainer.py", line 3161, in compute_loss outputs = model(inputs) File "/anaconda3/envs/myenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/anaconda3/envs/myenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, kwargs) File "/anaconda3/envs/myenv/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 186, in forward return self.gather(outputs, self.output_device) File "/anaconda3/envs/myenv/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 203, in gather return gather(outputs, output_device, dim=self.dim) File "/anaconda3/envs/myenv/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 104, in gather res = gather_map(outputs) File "/anaconda3/envs/myenv/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 95, in gather_map return type(out)((k, gather_map([d[k] for d in outputs])) File "", line 8, in init File "/anaconda3/envs/myenv/lib/python3.8/site-packages/transformers/utils/generic.py", line 393, in __post_init__ for idx, element in enumerate(iterator): File "/anaconda3/envs/myenv/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 95, in return type(out)((k, gather_map([d[k] for d in outputs])) File "/anaconda3/envs/myenv/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 89, in gather_map return Gather.apply(target_device, dim, outputs) File "/anaconda3/envs/myenv/lib/python3.8/site-packages/torch/autograd/function.py", line 598, in apply return super().apply(args, **kwargs) # type: ignore[misc] File "/anaconda3/envs/myenv/lib/python3.8/site-packages/torch/nn/parallel/_functions.py", line 75, in forward return comm.gather(inputs, ctx.dim, ctx.target_device) File "/anaconda3/envs/myenv/lib/python3.8/site-packages/torch/nn/parallel/comm.py", line 231, in gather return torch._C._gather(tensors, dim, destination) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 500.00 MiB. GPU

this time I use the tiny-llama for one A100,also ,it report the memory issue

fxmeng commented 6 months ago

Please use FSDP full_shard mode like: torchrun --nproc_per_node=4 --master_port= train.py \ --model_name_or_path $BASE_MODEL \ --output_dir $OUTPUT \ --data_path meta-math/MetaMathQA \ --dataset_split "train[:100000]"\ --dataset_field query response \ --num_train_epochs 1 \ --per_device_train_batch_size 4 \ --gradient_accumulation_steps 8 \ --save_strategy "steps" \ --save_steps 1000 \ --save_total_limit 1 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --bf16 True \ --tf32 True \ --fsdp "full_shard auto_wrap" \ --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \ --report_to none

GraphPKU / PiSSA

memory issue #7