Is it correct to set up fsdp for a machine (V100) that does not support bf16?

Luodian / Otter

🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.

https://otter-ntu.github.io/

MIT License

3.53k stars 241 forks source link

Is it correct to set up fsdp for a machine (V100) that does not support bf16? #274

Open xmc-andy opened 10 months ago

xmc-andy commented 10 months ago

compute_environment: LOCAL_MACHINE distributed_type: no downcast_bf16: false machine_rank: 0 main_training_function: main mixed_precision: fp16 num_machines: 1 num_processes: 1 rdzv_backend: static same_network: true tpu_use_cluster: false tpu_use_sudo: false use_cpu: false main_process_port: 20687

Luodian commented 10 months ago

yes it seems correct!

xmc-andy commented 10 months ago

OK，thank u，I also want to ask about the main thread memory is higher than other threads and overflow situation, how I should solve it, do you have suggestions?

yes it seems correct!

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

Luodian commented 10 months ago

I think you can refer to this link to see if you can do something.

https://github.com/huggingface/accelerate/blob/6b3e559926afc4b9a127eb7762fc523ea0ea656a/src/accelerate/big_modeling.py#L514

I know that you may able to set device_map=balanced_low_0 to decreased GPU usage on rank 0 (since rank0 will do gather operations and sometimes other params will be shifted to rank 0 so induce to OOM).

Luodian commented 10 months ago

Previously I see some code doing so but I didnt use it before, maybe you should do some search on device_map mechanism and how to set it. And we are welcome that you could update your experience to us to help more users tackle the problem on V100 GPU~

xmc-andy commented 10 months ago

Thank u for your shared suggestions, I will try them,

xmc-andy commented 10 months ago

I tried setting device_map to 'auto', 'balanced', 'balanced_low_0' or 'sequential' respectively. Unfortunately, it still overflows the memory on 3 V100s (unfrozen ViT). In comparison, I think balanced_low_0 is It might be possible if I have enough cards, I will try it further if I have 4 V100s.