axolotl-ai-cloud / axolotl

Go ahead and axolotl questions
https://axolotl-ai-cloud.github.io/axolotl/
Apache License 2.0
7.58k stars 822 forks source link

Mixtral 8x7B full finetune with DS zero3: Assertion error #954

Open casper-hansen opened 9 months ago

casper-hansen commented 9 months ago

Please check that this issue hasn't been reported before.

Expected Behavior

That the model can start training after the DeepSpeed fix on main.

Current behaviour

The model loads and does not OOM, but DeepSpeed raises an assertion on checking that the datatype is the same for all tensors:

assert len(set(t.dtype for t in tensors)) == 1

Traceback

Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
      File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
      File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
        return _run_code(code, main_globals, None,exec(code, run_globals)

      File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)  File "/axolotl/src/axolotl/cli/train.py", line 38, in <module>

  File "/axolotl/src/axolotl/cli/train.py", line 38, in <module>
        exec(code, run_globals)fire.Fire(do_cli)

          File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 141, in Fire
  File "/axolotl/src/axolotl/cli/train.py", line 38, in <module>
return _run_code(code, main_globals, None,fire.Fire(do_cli)

      File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
return _run_code(code, main_globals, None,      File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 141, in Fire
Traceback (most recent call last):

fire.Fire(do_cli)    
      File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)component_trace = _Fire(component, args, parsed_flag_args, context, name)  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
  File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 141, in Fire

      File "/axolotl/src/axolotl/cli/train.py", line 38, in <module>
  File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 475, in _Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
      File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 475, in _Fire
Traceback (most recent call last):
exec(code, run_globals)        component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
fire.Fire(do_cli)
  File "/axolotl/src/axolotl/cli/train.py", line 38, in <module>

  File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(  File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 141, in Fire

      File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component, remaining_args = _CallAndUpdateTrace(
  File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    fire.Fire(do_cli)
  File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 141, in Fire
Traceback (most recent call last):
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
      File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 475, in _Fire
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,component, remaining_args = _CallAndUpdateTrace(

  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
  File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace
        component_trace = _Fire(component, args, parsed_flag_args, context, name)    
component = fn(*varargs, **kwargs)component = fn(*varargs, **kwargs)  File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 475, in _Fire

exec(code, run_globals)  File "/axolotl/src/axolotl/cli/train.py", line 34, in do_cli
    return _run_code(code, main_globals, None,  File "/axolotl/src/axolotl/cli/train.py", line 34, in do_cli

component, remaining_args = _CallAndUpdateTrace(
      File "/axolotl/src/axolotl/cli/train.py", line 38, in <module>

  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
component = fn(*varargs, **kwargs)      File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace

train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)component, remaining_args = _CallAndUpdateTrace(  File "/axolotl/src/axolotl/cli/train.py", line 34, in do_cli

  File "/axolotl/src/axolotl/train.py", line 129, in train
  File "/axolotl/src/axolotl/train.py", line 129, in train
          File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace
fire.Fire(do_cli)exec(code, run_globals)    

return _run_code(code, main_globals, None,  File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 141, in Fire

train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)          File "/axolotl/src/axolotl/cli/train.py", line 38, in <module>
      File "/usr/lib/python3.8/runpy.py", line 87, in _run_code

trainer.train(resume_from_checkpoint=resume_from_checkpoint)trainer.train(resume_from_checkpoint=resume_from_checkpoint)component = fn(*varargs, **kwargs)  File "/axolotl/src/axolotl/train.py", line 129, in train

component = fn(*varargs, **kwargs)    
  File "/axolotl/src/axolotl/cli/train.py", line 34, in do_cli
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1540, in train
      File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1540, in train
component_trace = _Fire(component, args, parsed_flag_args, context, name)      File "/axolotl/src/axolotl/cli/train.py", line 34, in do_cli
    fire.Fire(do_cli)trainer.train(resume_from_checkpoint=resume_from_checkpoint)
exec(code, run_globals)    

  File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 475, in _Fire

train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)      File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 141, in Fire
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1540, in train
  File "/axolotl/src/axolotl/cli/train.py", line 38, in <module>

train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)  File "/axolotl/src/axolotl/train.py", line 129, in train

  File "/axolotl/src/axolotl/train.py", line 129, in train
    fire.Fire(do_cli)
  File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 141, in Fire
            component, remaining_args = _CallAndUpdateTrace(    trainer.train(resume_from_checkpoint=resume_from_checkpoint)component_trace = _Fire(component, args, parsed_flag_args, context, name)

trainer.train(resume_from_checkpoint=resume_from_checkpoint)  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1540, in train
  File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 475, in _Fire
  File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace

      File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1540, in train
    component_trace = _Fire(component, args, parsed_flag_args, context, name)return inner_training_loop(    

return inner_training_loop(  File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 475, in _Fire
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1678, in _inner_training_loop

  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1678, in _inner_training_loop
    component, remaining_args = _CallAndUpdateTrace(
  File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
      File "/axolotl/src/axolotl/cli/train.py", line 34, in do_cli
return inner_training_loop(
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1678, in _inner_training_loop
    component, remaining_args = _CallAndUpdateTrace(
  File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)
  File "/axolotl/src/axolotl/train.py", line 129, in train
            component = fn(*varargs, **kwargs)trainer.train(resume_from_checkpoint=resume_from_checkpoint)return inner_training_loop(

  File "/axolotl/src/axolotl/cli/train.py", line 34, in do_cli
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1540, in train
      File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1678, in _inner_training_loop
return inner_training_loop(    
component = fn(*varargs, **kwargs)      File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1678, in _inner_training_loop

train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)  File "/axolotl/src/axolotl/cli/train.py", line 34, in do_cli

model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(  File "/axolotl/src/axolotl/train.py", line 129, in train

      File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1284, in prepare
model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(    
train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)  File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1284, in prepare

  File "/axolotl/src/axolotl/train.py", line 129, in train
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1540, in train
        trainer.train(resume_from_checkpoint=resume_from_checkpoint)model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(

  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1540, in train
  File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1284, in prepare
    return inner_training_loop(
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1678, in _inner_training_loop
    model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
          File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1284, in prepare
result = self._prepare_deepspeed(*args)    model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
result = self._prepare_deepspeed(*args)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1666, in _prepare_deepspeed

  File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1284, in prepare
  File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1666, in _prepare_deepspeed
    return inner_training_loop(
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1678, in _inner_training_loop
    result = self._prepare_deepspeed(*args)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1666, in _prepare_deepspeed
    return inner_training_loop(
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1678, in _inner_training_loop
    result = self._prepare_deepspeed(*args)
      File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1666, in _prepare_deepspeed
result = self._prepare_deepspeed(*args)    
model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(  File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1666, in _prepare_deepspeed

  File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1284, in prepare
        engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)

  File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 171, in initialize
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 171, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 304, in __init__
    engine = DeepSpeedEngine(args=args,        
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 304, in __init__

  File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 171, in initialize
  File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1284, in prepare
    self._configure_optimizer(optimizer, model_parameters)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1225, in _configure_optimizer
        engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)result = self._prepare_deepspeed(*args)        

    engine = DeepSpeedEngine(args=args,model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(  File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 171, in initialize
self._configure_optimizer(optimizer, model_parameters)  File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1666, in _prepare_deepspeed
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)

  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 304, in __init__
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1225, in _configure_optimizer
      File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1284, in prepare
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 171, in initialize
engine = DeepSpeedEngine(args=args,
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 304, in __init__
        result = self._prepare_deepspeed(*args)self.optimizer = self._configure_zero_optimizer(basic_optimizer)

  File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1666, in _prepare_deepspeed
    self._configure_optimizer(optimizer, model_parameters)  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1552, in _configure_zero_optimizer
engine = DeepSpeedEngine(args=args,

  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1225, in _configure_optimizer
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 304, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1225, in _configure_optimizer
        self.optimizer = self._configure_zero_optimizer(basic_optimizer)self._configure_optimizer(optimizer, model_parameters)

engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1552, in _configure_zero_optimizer
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1225, in _configure_optimizer

result = self._prepare_deepspeed(*args)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 171, in initialize
  File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1666, in _prepare_deepspeed
        self.optimizer = self._configure_zero_optimizer(basic_optimizer)engine = DeepSpeedEngine(args=args,

      File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1552, in _configure_zero_optimizer
      File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 304, in __init__
optimizer = DeepSpeedZeroOptimizer_Stage3(self.optimizer = self._configure_zero_optimizer(basic_optimizer)

  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 314, in __init__
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1552, in _configure_zero_optimizer

  File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 171, in initialize
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
      File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1552, in _configure_zero_optimizer
    self._configure_optimizer(optimizer, model_parameters)    
optimizer = DeepSpeedZeroOptimizer_Stage3(engine = DeepSpeedEngine(args=args,  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1225, in _configure_optimizer

engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 304, in __init__
self._create_fp16_partitions_with_defragmentation(self.trainable_param_groups)  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 314, in __init__

  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 687, in _create_fp16_partitions_with_defragmentation
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 171, in initialize
    optimizer = DeepSpeedZeroOptimizer_Stage3(
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 314, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1225, in _configure_optimizer
        engine = DeepSpeedEngine(args=args,    optimizer = DeepSpeedZeroOptimizer_Stage3(

self._create_fp16_partitions_with_defragmentation(self.trainable_param_groups)  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 304, in __init__
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 314, in __init__

        self._create_fp16_partitions_with_defragmentation(self.trainable_param_groups)  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 687, in _create_fp16_partitions_with_defragmentation
        optimizer = DeepSpeedZeroOptimizer_Stage3(
device_buffer = __class__.defragment(parameter_partitions)self.optimizer = self._configure_zero_optimizer(basic_optimizer)

  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 687, in _create_fp16_partitions_with_defragmentation

  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 314, in __init__
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1552, in _configure_zero_optimizer
      File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 522, in defragment
    self._configure_optimizer(optimizer, model_parameters)self._create_fp16_partitions_with_defragmentation(self.trainable_param_groups)

  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1225, in _configure_optimizer
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 687, in _create_fp16_partitions_with_defragmentation
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
          File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1552, in _configure_zero_optimizer
device_buffer = __class__.defragment(parameter_partitions)self._create_fp16_partitions_with_defragmentation(self.trainable_param_groups)

  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 522, in defragment
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 687, in _create_fp16_partitions_with_defragmentation
        assert len(set(t.dtype for t in tensors)) == 1device_buffer = __class__.defragment(parameter_partitions)

  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 522, in defragment
AssertionError
    device_buffer = __class__.defragment(parameter_partitions)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 522, in defragment
    assert len(set(t.dtype for t in tensors)) == 1
AssertionError            optimizer = DeepSpeedZeroOptimizer_Stage3(    self.optimizer = self._configure_zero_optimizer(basic_optimizer)assert len(set(t.dtype for t in tensors)) == 1

device_buffer = __class__.defragment(parameter_partitions)

    AssertionErrorassert len(set(t.dtype for t in tensors)) == 1
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 314, in __init__
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1552, in _configure_zero_optimizer

optimizer = DeepSpeedZeroOptimizer_Stage3(
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 522, in defragment

AssertionError  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 314, in __init__

    self._create_fp16_partitions_with_defragmentation(self.trainable_param_groups)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 687, in _create_fp16_partitions_with_defragmentation
    self._create_fp16_partitions_with_defragmentation(self.trainable_param_groups)
      File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 687, in _create_fp16_partitions_with_defragmentation
assert len(set(t.dtype for t in tensors)) == 1
AssertionError
    device_buffer = __class__.defragment(parameter_partitions)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 522, in defragment
    optimizer = DeepSpeedZeroOptimizer_Stage3(
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 314, in __init__
    device_buffer = __class__.defragment(parameter_partitions)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 522, in defragment
        assert len(set(t.dtype for t in tensors)) == 1self._create_fp16_partitions_with_defragmentation(self.trainable_param_groups)

  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 687, in _create_fp16_partitions_with_defragmentation
AssertionError
    assert len(set(t.dtype for t in tensors)) == 1
AssertionError
    device_buffer = __class__.defragment(parameter_partitions)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 522, in defragment
    assert len(set(t.dtype for t in tensors)) == 1
AssertionError

Steps to reproduce

Reuse the config that I have provided and load the model on 8x A100.

Config yaml

base_model: mistralai/Mixtral-8x7B-v0.1
model_type: AutoModelForCausalLM
tokenizer_type: LlamaTokenizer
trust_remote_code: true

# loss is high without this
model_config:
  output_router_logits: true

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
  - path: <your_data>
dataset_prepared_path: 
val_set_size: 0.1
output_dir: /workspace

adapter: 
lora_model_dir: 

sequence_len: 32768
sample_packing: true
pad_to_sequence_len: true

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 1
micro_batch_size: 1
num_epochs: 3
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0005

train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

loss_watchdog_threshold: 5.0
loss_watchdog_patience: 3

warmup_ratio: 0.1
evals_per_epoch: 4
eval_table_size:
eval_table_max_new_tokens: 
saves_per_epoch: 1
debug:
deepspeed: zero3.json
weight_decay: 0.1
fsdp:
fsdp_config:
special_tokens:
  bos_token: "<s>"
  eos_token: <|im_end|>"
  unk_token: "<unk>"
tokens:
  - "<|im_start|>"
  - "<|im_end|>"

Possible solution

No response

Which Operating Systems are you using?

Python Version

3.8

axolotl branch-commit

main

Acknowledgements

dumpmemory commented 9 months ago

I have faced hang issues after 1:30 hours training time wiht ft and zero3

codybum commented 9 months ago

With the same config I get OOM while training on 5 x nodes with 8 x H100 each.

Any configs other than the example 4-bit qlora I have tried results in a OOM or some other error.

[2023-12-18 00:52:30,840] [ERROR] [axolotl.load_model:453] [PID:99] [RANK:7] CUDA out of memory. Tried to allocate 112.00 MiB (GPU 7; 79.11 GiB total capacity; 78.12 GiB already allocated; 40.62 MiB free; 78.23 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Cody

mynewstart commented 9 months ago

I have faced hang issues after 1:30 hours training time wiht ft and zero3

same question

dumpmemory commented 9 months ago

I have faced hang issues after 1:30 hours training time wiht ft and zero3

same question

u can try to update nccl 2.19.3

DhruvaBansal00 commented 6 months ago

Any updates on this error? I am seeing the same thing with Llama-v2 full finetune using zero3.

casper-hansen commented 6 months ago

I think this was solved by setting bf16 from auto to true instead in your deepspeed config

NanoCode012 commented 5 months ago

Does anyone still have this issue after trying casper's suggestion?