Can you share deepspeed configs for pretraining

FlagOpen / FlagEmbedding

Retrieval and Retrieval-augmented LLMs

MIT License

7.01k stars 513 forks source link

Can you share deepspeed configs for pretraining #132

Closed one-award closed 1 year ago

one-award commented 1 year ago

Thank you for sharing great works!

You mentioned you used deepspeed stage1 in pretraining, is it possible to share the deepspeed config and how to use the one? Furthermore, aside from gradient checkpointing, was there another way to reduce GPU memory?

Thank you in adavance.

staoxiao commented 1 year ago

Hi, you can use deepspeed and gradient checkpointing easily by adding arguments --deepspeed ds_config.json and --gradient_checkpointing. An example for ds_config.json is

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 12,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "bf16": {
        "enabled": "auto"
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupDecayLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto",
            "total_num_steps": "auto"
        }
    },

    "zero_optimization": {
        "stage": 1
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 100,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

To use less memory, you can use deepspeed stage 2 or 3.

one-award commented 1 year ago

Using your comment and guide, I ran it like this

torchrun --nproc_per_node 1 \
-m FlagEmbedding.baai_general_embedding.retromae_pretrain.run \
--output_dir output \
--model_name_or_path BAAI/bge-large-en \
--train_data examples/pretrain/toy_pretrain_data.jsonl \
--learning_rate 2e-5 \
--num_train_epochs 2 \
--per_device_train_batch_size 1 \
--dataloader_drop_last True \
--max_seq_length 512 \
--logging_steps 10 \
--gradient_checkpointing

but the following error occurred. AttributeError: 'RetroMAEForPretraining' object has no attribute 'gradient_checkpointing_enable'

staoxiao commented 1 year ago

We have updated the code, and It should work now. To use gradient checkpointing, you should use both deepspeed and gradient checkpointing, and the deepspeed stage can set to be 0 or 1. We find only using gradient checkpointing without deepspeed will result in an error.

staoxiao commented 1 year ago

Besides, we use deepspeed and gradient checkpointing in contrastive learning. For the retromae pre-training, we don't use any method to reduce the memory cost.

one-award commented 1 year ago

In my environment (transformers==4.31.0, deepspeed==0.10.3) Following your advice, I got the following error when saving the model after training in the line.

_save() got an unexpected keyword argument 'state_dict'

I succeeded in reducing the memory by making the following modifications.

def _save(self, output_dir: Optional[str] = None, state_dict=None):

one-award commented 1 year ago

I wonder about pretraining. According to your description in recipe

For english, we trained our model on 48 A100(40G) GPUs with a large batch size of 32,784. For chinese, we trained our model on 24 A100(40G) GPUs with a large batch size of 19,200.

batch size per device are 683, 800 respectively for english and chinese. Is it possible to load the batch size on 1*A100 (40G) with BAAI/bge-large-en?

staoxiao commented 1 year ago

Yes, we use fp16, gradient checkpointing, and deepspeed.

one-award commented 1 year ago

Thanks for the kind explanation.

mechigonft commented 11 months ago

Hello, I have the same error, how can I fix it? TypeError: _save() got an unexpected keyword argument 'state_dict'