can you provide the running config of 65b models?

cyz14 commented 1 year ago

Hi, I'd like to run a 65B llama with LOMO, what config should I use to run the training on a 8*RTX 3090 machine? It would be very nice if you add config/args_lomo.yaml and config/ds_config.json for 65b models. Thanks.

KaiLv69 commented 1 year ago

Hi. You can change model_name_or_path in config/args_lomo.yaml to the corresponding name or path of 65b model to do that.

JoinHands commented 1 year ago

I have the same problem, llama 65B model with 8 * V100, hit oom, any other parameter should be set ? args:

# model
model_name_or_path: '/data/home/scv9622/run/LLaMA/65B_hf'
# data
dataset_name: 'multirc'
refresh: false
data_tag: 'base'
train_on_inputs: false
data_max_length: 1024
# training
# trainer
tag: 'lomo'
output_dir: 'outputs'
overwrite_output_dir: true
deepspeed: 'config/ds_config.json'
do_train: true
do_eval: false
evaluation_strategy: 'epoch'
per_device_train_batch_size: 16
per_device_eval_batch_size: 2
learning_rate: 0.03
weight_decay: 0
num_train_epochs: 10
lr_scheduler_type: 'linear'
warmup: 0.1
clip_grad_norm: 1.0
save_strategy: 'no'
save_total_limit: 0
seed: 42
#bf16: true
remove_unused_columns: false
load_best_model_at_end: false
metric_for_best_model: 'acc'
group_by_length: false
#report_to: 'wandb'
dataloader_pin_memory: false
gradient_checkpointing: true
predict_with_generate: true

{

    "bf16": {
        "enabled": false
    },
    "fp16": {
        "enabled": true
    },
    "zero_allow_untested_optimizer": true,
    "zero_force_ds_cpu_optimizer": false,

    "zero_optimization": {
        "stage": 3,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e8,
        "stage3_max_live_parameters": 1e8,
        "stage3_max_reuse_distance": 1e8,
        "stage3_gather_16bit_weights_on_model_save": true
    },

    "gradient_accumulation_steps": 1,
    "steps_per_print": 2000,
    "train_micro_batch_size_per_gpu": 2,
    "wall_clock_breakdown": false
}

message details: "torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 968.00 MiB (GPU 0; 31.75 GiB total capacity; 25.87 GiB already allocated; 805.75 MiB free; 29.64 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF"

KaiLv69 commented 1 year ago

Sry that I misunderstood your question. batch_size 16 and data_max_length 1024 is not suitable for 65B model on RTX 3090 or V100, because the activation is too large. Maybe you can set batch_size to 1 or 2, or shorten data_max_length? :)

JoinHands commented 1 year ago

thx, I succeed run 65B model with batch_size=1 and data_max_length=835, it costs 47min each epoch（26G gpu memery 、 8*v100 node with nvlink). to achieve the performance as has been noted by the paper，is the 3090 Gpus had nvlink? Finally, we successfully train the 65B model using 8 RTX 3090 GPUs, achieving a throughput of 4.93 TGS. Utilizing such a server configuration and LOMO, the training process on 1000 samples, each containing 512 tokens, requires approximately 3.6 hours.

KaiLv69 commented 1 year ago

Glad to hear that! To speed up, you can turn off loss scaling (e.g. use BF16) and use grad clip instead of grad norm (set clip_grad_norm in config to None and clip_grad_value to 1.0 or somewhat) to save the extra computation. BTW, your speed (47min for 1000 samples with 835 tokens) is already faster than the performance in the paper.

alisyzhu commented 1 year ago

thx, I succeed run 65B model with batch_size=1 and data_max_length=835, it costs 47min each epoch（26G gpu memery 、 8*v100 node with nvlink). to achieve the performance as has been noted by the paper，is the 3090 Gpus had nvlink? Finally, we successfully train the 65B model using 8 RTX 3090 GPUs, achieving a throughput of 4.93 TGS. Utilizing such a server configuration and LOMO, the training process on 1000 samples, each containing 512 tokens, requires approximately 3.6 hours.

hi, I run 65B llama model with batch_size=1, and data_max_length=512, (32G gpu memory, 8 * V100 node), but failed. could you tell me your successful config info ? I try it with 65B/33B on lomo / lomo+lora, all failed.

this is my args_lomo.yaml file

this is my ds_config.json file

KaiLv69 commented 1 year ago

thx, I succeed run 65B model with batch_size=1 and data_max_length=835, it costs 47min each epoch（26G gpu memery 、 8*v100 node with nvlink). to achieve the performance as has been noted by the paper，is the 3090 Gpus had nvlink? Finally, we successfully train the 65B model using 8 RTX 3090 GPUs, achieving a throughput of 4.93 TGS. Utilizing such a server configuration and LOMO, the training process on 1000 samples, each containing 512 tokens, requires approximately 3.6 hours.

hi, I run 65B llama model with batch_size=1, and data_max_length=512, (32G gpu memory, 8 * V100 node), but failed. could you tell me your successful config info ? I try it with 65B/33B on lomo / lomo+lora, all failed.

this is my args_lomo.yaml file this is my ds_config.json file

Solved here. #28

OpenLMLab / LOMO

can you provide the running config of 65b models? #7