Question about LLaMA based federated training

alibaba / FederatedScope

An easy-to-use federated learning platform

https://www.federatedscope.io

Apache License 2.0

1.26k stars 206 forks source link

Question about LLaMA based federated training #742

Closed Polaris-JZ closed 7 months ago

Polaris-JZ commented 8 months ago

Hi,

I use your config llama.yaml to conduct federated training, but the training log shows that the loss is not decreasing/convergent, and the test loss is very high. I was wondering if there are any problems.

I only change two places in the llama.yaml:

train: is_enable_half: True
model type: 'baffo32/decapoda-research-llama-7B-hf@huggingface_llm'

Screenshot 2023-12-27 at 14 23 08

rayrayraykk commented 8 months ago

The yaml you used is only a testcase, could you please try configs in https://github.com/alibaba/FederatedScope/tree/llm/federatedscope/llm/baseline/exp_yaml/alpaca with tuned hyperparameters?

Polaris-JZ commented 8 months ago

Thanks for your advice.

I have tried https://github.com/alibaba/FederatedScope/blob/llm/federatedscope/llm/baseline/exp_yaml/alpaca/alpaca_federate.yaml, the test loss become lower, but train loss also fluctuates a lot.

Additionally, I try to create another dataset followed the format of alpaca, the same thing happens: the loss is not decreasing/convergent, and the test loss is very high (like 3000). Could you please help me to give some advice?

rayrayraykk commented 8 months ago

Assuming your dataset is good enough, you can try to adjust the following hyper-parameters:

Learning Rate: Fluctuations in training loss might be due to a learning rate that's too high. Try using a smaller learning rate or a learning rate scheduler.
Batch Size: SGD is used by default, the size of the mini-batch can significantly affect the training. Try using larger batch-size.

Polaris-JZ commented 8 months ago

Thanks for your advice. I have another problem. When I'm trying to use the deepspeed accelation. The error rises: KeyError: 'Non-existent config key: llm.accelation'. My config is:

rayrayraykk commented 8 months ago

Sorry for the outdated document. Please use the following configs to setup Deepspeed (for other usage, please refer to https://github.com/alibaba/FederatedScope/blob/llm/federatedscope/core/configs/cfg_llm.py):

    # ---------------------------------------------------------------------- #
    # Deepspeed related options
    # ---------------------------------------------------------------------- #
    cfg.llm.deepspeed = CN()
    cfg.llm.deepspeed.use = False
    cfg.llm.deepspeed.ds_config = ''

We'll fix it ASAP.

Polaris-JZ commented 8 months ago

Thanks for your reply. I was wondering if there was any method to enable multi-gpu training for a client or under centralized training setting.

rayrayraykk commented 7 months ago

You can set cfg.train.data_para_dids = [] # torch.nn.DataParallel devices to enable DataParallel training.