axolotl-ai-cloud / axolotl

Go ahead and axolotl questions
https://axolotl-ai-cloud.github.io/axolotl/
Apache License 2.0
7.6k stars 827 forks source link

Cannot copy out of meta tensor; no data!, Dolphin Mixtral 2.7, DeepSpeed Zero3 #1249

Open luijait opened 7 months ago

luijait commented 7 months ago

Please check that this issue hasn't been reported before.

Expected Behavior

Zero 2 load weights but... I'm limited to using zero3 instead of zero2 due to graphics card issues.

Current behaviour

running with torchrun --standalone --master_port 37229 --nproc_per_node=9 axolotl/cli/train.py ../../../config.yml & accelerate NotImplementedError: Cannot copy out of meta tensor; no data! Traceback (most recent call last): File "/home/omegarig30/axolotl/axolotl/src/axolotl/cli/train.py", line 38, in fire.Fire(do_cli) File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, kwargs) File "/home/omegarig30/axolotl/axolotl/src/axolotl/cli/train.py", line 34, in do_cli train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta) File "/home/omegarig30/axolotl/src/axolotl/train.py", line 80, in train model, peft_config = load_model(cfg, tokenizer, inference=cli_args.inference) File "/home/omegarig30/axolotl/src/axolotl/utils/models.py", line 624, in load_model raise err File "/home/omegarig30/axolotl/src/axolotl/utils/models.py", line 616, in load_model model = AutoModelForCausalLM.from_pretrained( File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained return model_class.from_pretrained( File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3850, in from_pretrained ) = cls._load_pretrained_model( File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 4284, in _load_pretrained_model new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model( File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 839, in _load_state_dict_into_meta_model set_module_quantized_tensor_to_device(model, param_name, param_device, value=param) File "/usr/local/lib/python3.10/dist-packages/transformers/integrations/bitsandbytes.py", line 128, in set_module_quantized_tensor_to_device new_value = value.to(device) NotImplementedError: Cannot copy out of meta tensor; no data! Traceback (most recent call last): File "/home/omegarig30/axolotl/axolotl/src/axolotl/cli/train.py", line 38, in fire.Fire(do_cli) File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, *kwargs) File "/home/omegarig30/axolotl/axolotl/src/axolotl/cli/train.py", line 34, in do_cli train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta) File "/home/omegarig30/axolotl/src/axolotl/train.py", line 80, in train model, peft_config = load_model(cfg, tokenizer, inference=cli_args.inference) File "/home/omegarig30/axolotl/src/axolotl/utils/models.py", line 624, in load_model raise err File "/home/omegarig30/axolotl/src/axolotl/utils/models.py", line 616, in load_model model = AutoModelForCausalLM.from_pretrained( File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained return model_class.from_pretrained( File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3850, in from_pretrained ) = cls._load_pretrained_model( File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 4284, in _load_pretrained_model new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model( File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 839, in _load_state_dict_into_meta_model set_module_quantized_tensor_to_device(model, param_name, param_device, value=param) File "/usr/local/lib/python3.10/dist-packages/transformers/integrations/bitsandbytes.py", line 128, in set_module_quantized_tensor_to_device new_value = value.to(device) NotImplementedError: Cannot copy out of meta tensor; no data! [2024-02-02 18:19:16,538] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 22965 closing signal SIGTERM [2024-02-02 18:19:17,053] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 22966) of binary: /usr/bin/python3 Traceback (most recent call last): File "/usr/local/bin/torchrun", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper return f(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 812, in main run(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 135, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 268, in launch_agent

Steps to reproduce

Just run. the commando

Config yaml

base_model: cognitivecomputations/dolphin-2.7-mixtral-8x7b model_type: MixtralForCausalLM tokenizer_type: LlamaTokenizer is_mistral_derived_model: false

load_in_8bit: false load_in_4bit: true strict: false device_map: null model_config: output_router_logits: false

datasets:

dataset_prepared_path: val_set_size: 0.05 eval_sample_packing: false output_dir: /home/omegarig30/models/0dai_mixtral resume_from_checkpoint: hf_use_auth_token:

adapter: qlora lora_model_dir:

sequence_len: 16384 sample_packing: true pad_to_sequence_len: true

lora_r: 32 lora_alpha: 16 lora_dropout: 0.05 lora_target_modules: lora_target_linear: true lora_fan_in_fan_out: lora_modules_to_save:

wandb_project: 0dai_mixtral1 wandb_entity: wandb_watch: wandb_run_id: wandb_log_model:

gradient_accumulation_steps: 1 micro_batch_size: 1 num_epochs: 3 optimizer: adamw_torch lr_scheduler: cosine learning_rate: 0.0002 torch_compile: false train_on_inputs: false group_by_length: false bf16: true fp16: false tf32: false

gradient_checkpointing: true early_stopping_patience: local_rank: logging_steps: 1 xformers_attention: flash_attention: true

warmup_steps: 10 eval_steps: 0.1 save_steps: 0.1 save_total_limit: 2 eval_sample_packing: true debug: deepspeed: /home/omegarig30/axolotl/deepspeed_configs/zero3_bf16.json weight_decay: 0.001 special_tokens: eos_token: "<|im_end|>" tokens:

Possible solution

No response

Which Operating Systems are you using?

Python Version

3.10

axolotl branch-commit

Last version

Acknowledgements

komninoschatzipapas commented 1 month ago

Getting the same with https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-70B