alibaba / FederatedScope

An easy-to-use federated learning platform
Apache License 2.0
1.26k stars 206 forks source link

Question about LLaMA based federated training #742

Closed Polaris-JZ closed 7 months ago

Polaris-JZ commented 8 months ago


I use your config llama.yaml to conduct federated training, but the training log shows that the loss is not decreasing/convergent, and the test loss is very high. I was wondering if there are any problems.

I only change two places in the llama.yaml:

Screenshot 2023-12-27 at 14 23 08

rayrayraykk commented 8 months ago

The yaml you used is only a testcase, could you please try configs in with tuned hyperparameters?

Polaris-JZ commented 8 months ago

Thanks for your advice.

I have tried, the test loss become lower, but train loss also fluctuates a lot.

Additionally, I try to create another dataset followed the format of alpaca, the same thing happens: the loss is not decreasing/convergent, and the test loss is very high (like 3000). Could you please help me to give some advice?

rayrayraykk commented 8 months ago

Assuming your dataset is good enough, you can try to adjust the following hyper-parameters:

Polaris-JZ commented 8 months ago

Thanks for your advice. I have another problem. When I'm trying to use the deepspeed accelation. The error rises: KeyError: 'Non-existent config key: llm.accelation'. My config is:

Screenshot 2024-01-01 at 20 20 08
rayrayraykk commented 8 months ago

Sorry for the outdated document. Please use the following configs to setup Deepspeed (for other usage, please refer to

    # ---------------------------------------------------------------------- #
    # Deepspeed related options
    # ---------------------------------------------------------------------- #
    cfg.llm.deepspeed = CN()
    cfg.llm.deepspeed.use = False
    cfg.llm.deepspeed.ds_config = ''

We'll fix it ASAP.

Polaris-JZ commented 8 months ago

Thanks for your reply. I was wondering if there was any method to enable multi-gpu training for a client or under centralized training setting.

rayrayraykk commented 7 months ago

You can set cfg.train.data_para_dids = [] # torch.nn.DataParallel devices to enable DataParallel training.