LISA can not run on multi-GPU setting

AgentLLM commented 6 months ago

Please check that this issue hasn't been reported before.

[X] I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

The LISA should run on multi-GPU.

Current behaviour

The LISA can only run on single-GPU. Change to the multi-GPU will lead to below bug.

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused pa
rameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by                                                                                 
making sure all `forward` function outputs participate in calculating loss.                                                                                                                                            
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and th
e structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).                                                                                                     
Parameter indices which did not receive grad for rank 1: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56
 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 ...                                                                              
 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error

Steps to reproduce

Below is the multi-GPU config.

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
gpu_ids: 
machine_rank: 0
mixed_precision: bf16
num_machines: 1
num_processes: 2

Config yaml

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
gpu_ids: 
machine_rank: 0
mixed_precision: bf16
num_machines: 1
num_processes: 2

Possible solution

No response

Which Operating Systems are you using?

[X] Linux
[ ] macOS
[ ] Windows

Python Version

3.9

axolotl branch-commit

main

Acknowledgements

[X] My issue title is concise, descriptive, and in title casing.
[X] I have searched the existing issues to make sure this bug has not been reported yet.
[X] I am using the latest version of axolotl.
[X] I have provided enough information for the maintainers to reproduce and diagnose the issue.

winglian commented 6 months ago

Can you try setting ddp_find_unused_parameters: true ?

AgentLLM commented 6 months ago

Can you try setting ddp_find_unused_parameters: true ?

Add ddp_find_unused_parameters: true in the lisa.yaml, and have the same bug.

winglian commented 6 months ago

are you using FSDP or deepspeed?

winglian commented 6 months ago

It seems this might be some DDP specific issue. I've tried a few things like setting a deterministic seed for the random layer picker and adding ddp_find_unused_parameters: true, to no avail.

AgentLLM commented 6 months ago

are you using FSDP or deepspeed?

I'm not sure which one I am using. This is my first time using your LLM framework, and I've only added 'ddp_find_unused_parameters: true' to the 'lisa.yaml' file without making any other changes.

lhl commented 6 months ago

btw, here's a discussion on deepspeed issues w/ LISA: https://github.com/OptimalScale/LMFlow/issues/726 and a potential workaround: https://github.com/OptimalScale/LMFlow/issues/726#issuecomment-2041335788

axolotl-ai-cloud / axolotl