Closed cyz14 closed 1 year ago
Hi. You can change model_name_or_path
in config/args_lomo.yaml
to the corresponding name or path of 65b model to do that.
I have the same problem, llama 65B model with 8 * V100, hit oom, any other parameter should be set ? args:
# model
model_name_or_path: '/data/home/scv9622/run/LLaMA/65B_hf'
# data
dataset_name: 'multirc'
refresh: false
data_tag: 'base'
train_on_inputs: false
data_max_length: 1024
# training
# trainer
tag: 'lomo'
output_dir: 'outputs'
overwrite_output_dir: true
deepspeed: 'config/ds_config.json'
do_train: true
do_eval: false
evaluation_strategy: 'epoch'
per_device_train_batch_size: 16
per_device_eval_batch_size: 2
learning_rate: 0.03
weight_decay: 0
num_train_epochs: 10
lr_scheduler_type: 'linear'
warmup: 0.1
clip_grad_norm: 1.0
save_strategy: 'no'
save_total_limit: 0
seed: 42
#bf16: true
remove_unused_columns: false
load_best_model_at_end: false
metric_for_best_model: 'acc'
group_by_length: false
#report_to: 'wandb'
dataloader_pin_memory: false
gradient_checkpointing: true
predict_with_generate: true
{
"bf16": {
"enabled": false
},
"fp16": {
"enabled": true
},
"zero_allow_untested_optimizer": true,
"zero_force_ds_cpu_optimizer": false,
"zero_optimization": {
"stage": 3,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e8,
"stage3_max_live_parameters": 1e8,
"stage3_max_reuse_distance": 1e8,
"stage3_gather_16bit_weights_on_model_save": true
},
"gradient_accumulation_steps": 1,
"steps_per_print": 2000,
"train_micro_batch_size_per_gpu": 2,
"wall_clock_breakdown": false
}
message details: "torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 968.00 MiB (GPU 0; 31.75 GiB total capacity; 25.87 GiB already allocated; 805.75 MiB free; 29.64 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF"
Sry that I misunderstood your question.
batch_size
16 and data_max_length
1024 is not suitable for 65B model on RTX 3090 or V100, because the activation is too large.
Maybe you can set batch_size
to 1 or 2, or shorten data_max_length
? :)
thx, I succeed run 65B model with batch_size=1 and data_max_length=835, it costs 47min each epoch(26G gpu memery 、 8*v100 node with nvlink). to achieve the performance as has been noted by the paper,is the 3090 Gpus had nvlink?
Finally, we successfully train the 65B model using 8 RTX 3090 GPUs, achieving a throughput of 4.93 TGS. Utilizing such a server configuration and LOMO, the training process on 1000 samples, each containing 512 tokens, requires approximately 3.6 hours.
Glad to hear that!
To speed up, you can turn off loss scaling (e.g. use BF16) and use grad clip instead of grad norm (set clip_grad_norm
in config to None
and clip_grad_value
to 1.0
or somewhat) to save the extra computation. BTW, your speed (47min for 1000 samples with 835 tokens) is already faster than the performance in the paper.
thx, I succeed run 65B model with batch_size=1 and data_max_length=835, it costs 47min each epoch(26G gpu memery 、 8*v100 node with nvlink). to achieve the performance as has been noted by the paper,is the 3090 Gpus had nvlink?
Finally, we successfully train the 65B model using 8 RTX 3090 GPUs, achieving a throughput of 4.93 TGS. Utilizing such a server configuration and LOMO, the training process on 1000 samples, each containing 512 tokens, requires approximately 3.6 hours.
hi, I run 65B llama model with batch_size=1, and data_max_length=512, (32G gpu memory, 8 * V100 node), but failed. could you tell me your successful config info ? I try it with 65B/33B on lomo / lomo+lora, all failed.
this is my args_lomo.yaml file
this is my ds_config.json file
thx, I succeed run 65B model with batch_size=1 and data_max_length=835, it costs 47min each epoch(26G gpu memery 、 8*v100 node with nvlink). to achieve the performance as has been noted by the paper,is the 3090 Gpus had nvlink?
Finally, we successfully train the 65B model using 8 RTX 3090 GPUs, achieving a throughput of 4.93 TGS. Utilizing such a server configuration and LOMO, the training process on 1000 samples, each containing 512 tokens, requires approximately 3.6 hours.
hi, I run 65B llama model with batch_size=1, and data_max_length=512, (32G gpu memory, 8 * V100 node), but failed. could you tell me your successful config info ? I try it with 65B/33B on lomo / lomo+lora, all failed.
this is my args_lomo.yaml file this is my ds_config.json file
Solved here. #28
Hi, I'd like to run a 65B llama with LOMO, what config should I use to run the training on a 8*RTX 3090 machine? It would be very nice if you add config/args_lomo.yaml and config/ds_config.json for 65b models. Thanks.