Deepspeed not partitioning the model across GPUs

mariokostelac commented 7 months ago

Please check that this issue hasn't been reported before.

[X] I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

When I run lora finetuning on 1 or multiple GPUs, memory usage stays roughly the same, even if deepspeed/zero3_bf16.json is used. I expect memory usage being halved/being ~25% of original usage once I make 2 or 4 GPUs available to the training.

I ran https://github.com/huggingface/accelerate/blob/31fd2b1ad6b9c1cd1480568399a311b3caaf62dc/examples/by_feature/deepspeed_with_config_support.py with different number of GPUs available (through CUDA_VISIBLE_DEVICES) and I can confirm that deepspeed + accelerate works as I expect in that other repo, but not with axolotl.

Current behaviour

Memory usage stays the same, regardless of number of GPUs used. That means that I can't finetune models that don't fit on 1 GPU.

Steps to reproduce

./venv/bin/accelerate launch -m axolotl.cli.train examples/code-llama/7b/lora.yml --deepspeed deepspeed/zero3_bf16.json

I've tried

./venv/bin/accelerate launch --use_deepspeed --deepspeed_config_file deepspeed/zero3_bf16.json  -m axolotl.cli.train examples/code-llama/7b/lora.yml --deepspeed deepspeed/zero3_bf16.json

but result was the same.

Config yaml

examples/code-llama/7b/lora.yml

base_model: codellama/CodeLlama-7b-hf
model_type: LlamaForCausalLM
tokenizer_type: CodeLlamaTokenizer
is_llama_derived_model: true

load_in_8bit: true
load_in_4bit: false
strict: false

datasets:
  - path: mhenrichsen/alpaca_2k_test
    type: alpaca
dataset_prepared_path:
val_set_size: 0.05
output_dir: ./lora-out

sequence_len: 4096
sample_packing: true
pad_to_sequence_len: true

adapter: lora
lora_model_dir:
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 4
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 10
evals_per_epoch: 4
saves_per_epoch: 1
eval_sample_packing: False
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"

deepspeed:

{
  "zero_optimization": {
    "stage": 3,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 0,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 0,
    "stage3_max_reuse_distance": 0,
    "stage3_gather_16bit_weights_on_model_save": true
  },
  "bf16": {
    "enabled": true
  },
  "fp16": {
    "enabled": "auto",
    "auto_cast": false,
    "loss_scale": 0,
    "initial_scale_power": 32,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": "auto",
      "betas": "auto",
      "eps": "auto",
      "weight_decay": "auto"
    }
  },
  "gradient_accumulation_steps": "auto",
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false
}

Possible solution

No response

Which Operating Systems are you using?

[X] Linux
[ ] macOS
[ ] Windows

Python Version

3.10

axolotl branch-commit

main/ece0211996f0f546d8ec1380ab3f7e180fd9c2c0

Acknowledgements

[X] My issue title is concise, descriptive, and in title casing.
[X] I have searched the existing issues to make sure this bug has not been reported yet.
[X] I am using the latest version of axolotl.
[X] I have provided enough information for the maintainers to reproduce and diagnose the issue.

winglian commented 7 months ago

@mariokostelac can you try this branch pls? https://github.com/OpenAccess-AI-Collective/axolotl/compare/deepspeed-low-cpu-mem?expand=1

winglian commented 7 months ago

One thing to also keep in mind is it's harder to compare that implementation with the HF trainer implementation because that is a more raw implementation https://github.com/huggingface/accelerate/blob/31fd2b1ad6b9c1cd1480568399a311b3caaf62dc/examples/by_feature/deepspeed_with_config_support.py#L18

mariokostelac commented 7 months ago

@winglian I'll try now and report.

To clarify, I'm having problem with GPU memory and there is almost no difference in memory usage. I'm training on 24GB GPUs and it doesn't matter how many of them I give to the training, memory usage is pretty much constant (~20GB on each GPU).

mariokostelac commented 7 months ago

Error I get now

ValueError: DeepSpeed Zero-3 is not compatible with `low_cpu_mem_usage=True` or with passing a `device_map`.
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/axolotl/src/axolotl/cli/train.py", line 43, in <module>
    fire.Fire(do_cli)
  File "/home/ubuntu/axolotl/venv/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/ubuntu/axolotl/venv/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/ubuntu/axolotl/venv/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/ubuntu/axolotl/src/axolotl/cli/train.py", line 39, in do_cli
    train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)
  File "/home/ubuntu/axolotl/src/axolotl/train.py", line 65, in train
    model, peft_config = load_model(cfg, tokenizer, inference=cli_args.inference)
  File "/home/ubuntu/axolotl/src/axolotl/utils/models.py", line 546, in load_model
    raise err
  File "/home/ubuntu/axolotl/src/axolotl/utils/models.py", line 424, in load_model
    model = LlamaForCausalLM.from_pretrained(
  File "/home/ubuntu/axolotl/venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2892, in from_pretrained
    raise ValueError(
ValueError: DeepSpeed Zero-3 is not compatible with `low_cpu_mem_usage=True` or with passing a `device_map`.
[2024-01-16 14:20:50,567] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 3328) of binary: /home/ubuntu/axolotl/venv/bin/python3
Traceback (most recent call last):
  File "/home/ubuntu/axolotl/./venv/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/axolotl/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/home/ubuntu/axolotl/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
    deepspeed_launcher(args)
  File "/home/ubuntu/axolotl/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 724, in deepspeed_launcher
    distrib_run.run(args)
  File "/home/ubuntu/axolotl/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/home/ubuntu/axolotl/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ubuntu/axolotl/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(

winglian commented 7 months ago

I think you're misunderstanding how deepspeed zero3 works. it will doesn't just simply decrease the VRAM requirements per GPU when you add more GPUs. Did you try loading a larger model like solar10B perhaps?

mariokostelac commented 7 months ago

I think you're misunderstanding how deepspeed zero3 works. it will doesn't just simply decrease the VRAM requirements per GPU when you add more GPUs.

When using Zero3 optimisation, I'd expect exactly this to happen. As the world size increases, each GPU should get smaller and smaller partition of the model, to the point where model wrapping becomes a blocker and we can't partition the model further. Graphics on https://www.deepspeed.ai/2021/03/07/zero3-offload.html support that statement, and the text also states that zero3 shards model parameters, optimizer states, and gradients.

When I've played with accelerate examples, it was very visible that the memory usage was dropping as I was adding more GPUs.

Deepspeed provides some helper functions to estimate the memory usage per GPU, for a given model. I've loaded codellama-13b-hf in 8bit and providing the output. This is not assuming any adapter so numbers are for full finetuning.


In [9]: deepspeed.runtime.zero.stage3.estimate_zero3_model_states_mem_needs_all_live(model, num_gpus_per_node=4)
Estimated memory needed for params, optim states and gradients for a:
HW: Setup with 1 node, 4 GPUs per node.
SW: Model with 6738M total params, 131M largest layer params.
  per CPU  |  per GPU |   Options
  169.45GB |   0.49GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1
  169.45GB |   0.49GB | offload_param=cpu , offload_optimizer=cpu , zero_init=0
  150.62GB |   3.63GB | offload_param=none, offload_optimizer=cpu , zero_init=1
  150.62GB |   3.63GB | offload_param=none, offload_optimizer=cpu , zero_init=0
    2.93GB |  28.73GB | offload_param=none, offload_optimizer=none, zero_init=1
  150.62GB |  28.73GB | offload_param=none, offload_optimizer=none, zero_init=0

In [10]: deepspeed.runtime.zero.stage3.estimate_zero3_model_states_mem_needs_all_live(model, num_gpus_per_node=1)
Estimated memory needed for params, optim states and gradients for a:
HW: Setup with 1 node, 1 GPU per node.
SW: Model with 6738M total params, 131M largest layer params.
  per CPU  |  per GPU |   Options
  169.45GB |   0.49GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1
  169.45GB |   0.49GB | offload_param=cpu , offload_optimizer=cpu , zero_init=0
  150.62GB |  13.04GB | offload_param=none, offload_optimizer=cpu , zero_init=1
  150.62GB |  13.04GB | offload_param=none, offload_optimizer=cpu , zero_init=0
    0.73GB | 113.45GB | offload_param=none, offload_optimizer=none, zero_init=1
   37.65GB | 113.45GB | offload_param=none, offload_optimizer=none, zero_init=0

You can see that deepspeed estimates more GPU memory needed for each GPU when fewer GPUs are used.

I was thinking about creating (hopefully) a small script showing that less memory can be used than axolotl does, when using deepspeed. Would that help?

winglian commented 7 months ago

you may need to set the following in the deepspeed json as well

        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },

winglian commented 7 months ago

here's a comparison of a full finetune of Mixtral 8x7B on 8x A6000s with the first 24 layers frozen. The first screen shot is without the offload optimizer and param set in the deepspeed json, and the second is with it set. I believe for zero3, the expectation is you should be offloading as much as possible

Nero10578 commented 7 months ago

here's a comparison of a full finetune of Mixtral 8x7B on 8x A6000s with the first 24 layers frozen. The first screen shot is without the offload optimizer and param set in the deepspeed json, and the second is with it set. I believe for zero3, the expectation is you should be offloading as much as possible

As far as I understand the whole point of deepspeed is to reduce VRAM usage per GPU when using more GPUs no? Meanwhile you're saying it would only reduce VRAM usage when offloading to CPU?

I get that the overall VRAM usage will not decrease with more GPUs, as that will only happen when offloading to CPU. But we should be getting lower VRAM use per GPU as more GPUs are added right? That doesn't seem to be happening with Zero2 or Zero3 with Axolotl right now.

mariokostelac commented 7 months ago

I've found the solution to the issue I was referring to originally.

Loading the model in 8bit breaks deepspeed initialisation and model parameter sharding.

Setting

load_in_8bit: false
load_in_4bit: false

and explicitly setting

  "bf16": {
    "enabled": true
  },

in zero3.json does what I wanted - shards the model straight during loading. It allowed me to fine-tune 13b model in bfloat16 on 24GB cards, which wasn't possible before. I think everything works as expected as long as you don't use loading in 4/8bit. It might be worth dropping a warning in these cases.

CPU offloading just further reduces the pressure from the training, but I wasn't able to even load the model when load_in_8bit was set to true.

mariokostelac commented 7 months ago

As far as I understand the whole point of deepspeed is to reduce VRAM usage per GPU when using more GPUs no? Meanwhile you're saying it would only reduce VRAM usage when offloading to CPU?

I get that the overall VRAM usage will not decrease with more GPUs, as that will only happen when offloading to CPU. But we should be getting lower VRAM use per GPU as more GPUs are added right? That doesn't seem to be happening with Zero2 or Zero3 with Axolotl right now.

I believe you're right. CPU offloading with zero3 is just another tool to reduce VRAM pressure, but it's not free - you pay with the latency. Zero3 without cpu offloading should work well, too. Turns out it does, but not when loading in 8bit.

Nero10578 commented 7 months ago

I've found the solution to the issue I was referring to originally.

Loading the model in 8bit breaks deepspeed initialisation and model parameter sharding.

Setting
load_in_8bit: false
load_in_4bit: false
and explicitly setting
  "bf16": {
    "enabled": true
  },
in zero3.json does what I wanted - shards the model straight during loading. It allowed me to fine-tune 13b model in bfloat16 on 24GB cards, which wasn't possible before. I think everything works as expected as long as you don't use loading in 4/8bit. It might be worth dropping a warning in these cases.

CPU offloading just further reduces the pressure from the training, but I wasn't able to even load the model when load_in_8bit was set to true.

Wow I didn't know that. So we are supposed to not use load in 4-bit and 8-bit for sharding to work in axolotl? Seems like something broken in the code base then because afaik it should still work when running load in 4-bit and 8-bit.

Not loading in 4-bit or 8-bit will cause much higher training times and also negate the benefit of running zero3 imo. Since I can train 34b models using load in 4bit in 24GB cards already. I was hoping for a faster way to train 70b models with 2x24GB cards.

mariokostelac commented 7 months ago

Don't take me as an expert of what's possible and what's not, but I'm fairly confident that loading in 8bit doesn't load models in sharded and partitioned fashion as you'd expect, at least not the way it's implemented in axolotl atm. I've managed to reproduce it in a small script using tranformers, accelerate, and deepspeed, too, and haven't managed to get it working with load_in_8bit.

Is the code broken? Maybe. Issue in peft repository suggests deepspeed and int8 didn't work in April last year. That still might be the case.

As axolotl user, I'd like to get a warning at least. @winglian what do you say about printing a warning when zero3 and load_in_8bit are used together?

Nero10578 commented 7 months ago

Don't take me as an expert of what's possible and what's not, but I'm fairly confident that loading in 8bit doesn't load models in sharded and partitioned fashion as you'd expect, at least not the way it's implemented in axolotl atm. I've managed to reproduce it in a small script using tranformers, accelerate, and deepspeed, too, and haven't managed to get it working with load_in_8bit.

Is the code broken? Maybe. Issue in peft repository suggests deepspeed and int8 didn't work in April last year. That still might be the case.

As axolotl user, I'd like to get a warning at least. @winglian what do you say about printing a warning when zero3 and load_in_8bit are used together?

Ohh it doesn't work even using a more "bare-metal" python code as well? I personally haven't tried it but from reading the hype about QLora and how we can train using 24GB GPUs now I assumed it meant we can load in 4-bit and shard with zero3. Guess not yet then.

Thanks for testing this out and letting us all know. Also I agree a warning would be nice that deepspeed only works without load in 4-bit or 8-bit.

mariokostelac commented 7 months ago

Ohh it doesn't work even using a more "bare-metal" python code as well?

I can post the code that helped me figure out that changing to int8 doesn't work, it's not long. It's not proving that it can't work ofc, but I'm getting more confident in that because searching for int8 on deepspeed mentions only inference.

Nondzu commented 7 months ago

@mariokostelac what about zero1 and zero2? should I also set to false load_in_8bit load_in_4bit ?

qeternity commented 5 months ago

@mariokostelac what about zero1 and zero2? should I also set to false load_in_8bit load_in_4bit ?

zero1 and zero2 work fine in 8bit and 4bit

winglian commented 5 months ago

@mariokostelac There is an upstream fix in transformers that fixes this for Deepspeed Zero 3 now (part of the qlora+FSDP fixes that went out a couple of weeks ago)

mariokostelac commented 3 months ago

@winglian are these fixes used in the main branch?

Nero10578 commented 3 months ago

@winglian are these fixes used in the main branch?

Would be great if we can use it with axolotl. I have a bunch of 24GB GPUs.

axolotl-ai-cloud / axolotl