thusinh1969 commented 1 year ago

Ubuntu 20.04 WSL2 docker with RTX3090, CUDA 11.8/CUDNN 8.6 Python 3.10 pip -r requirements.txt --> Same docker environment to run OK with oobabooga and automatic1111, cloned exclusively for this repo, nothing shared.

python finetune.py \ model_path /data> --model_path /data/oobabooga/text-generation-webui/models/pyllama_data/7B/hf/7B \ --data_path /data/oobabooga/text-generation-webui/training/datasets/AUG_3_thohay_VI_EN.json \ --output_path ./training_output/NGUYEN-vicuna-7B-lora

issues

/data/oobabooga/text-generation-webui/models/pyllama_data/7B/hf/7B Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████| 33/33 [00:27<00:00, 1.21it/s]The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. The tokenizer class you load from this checkpoint is 'LLaMATokenizer'. The class this function is called from is 'LlamaTokenizer'. normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization. Using custom data configuration default-696b202e8f88a4af Found cached dataset json (/home/steve/.cache/huggingface/datasets/json/default-696b202e8f88a4af/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 381.47it/s]trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199 Loading cached split indices for dataset at /home/steve/.cache/huggingface/datasets/json/default-696b202e8f88a4af/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-1b21efe29a5f1a62.arrow and /home/steve/.cache/huggingface/datasets/json/default-696b202e8f88a4af/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-45c440eb1a004493.arrow 100%|███████████████████████████████████████████████████████████████████████████████████████| 17121/17121 [00:55<00:00, 307.11ex/s] 100%|███████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 341.20ex/s] ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /data/oobabooga/Chinese-Vicuna/finetune.py:240 in │ │ │ │ 237 │ train_data = data["train"].shuffle().map(generate_and_tokenize_prompt) │ │ 238 │ val_data = None │ │ 239 │ │ ❱ 240 trainer = transformers.Trainer( │ │ 241 │ model=model, │ │ 242 │ train_dataset=train_data, │ │ 243 │ eval_dataset=val_data, │ │ │ │ /data/oobabooga/Chinese-Vicuna/env/lib/python3.10/site-packages/transformers/trainer.py:478 in │ │ init │ │ │ │ 475 │ │ self.tokenizer = tokenizer │ │ 476 │ │ │ │ 477 │ │ if self.place_model_on_device and not getattr(model, "is_loaded_in_8bit", False) │ │ ❱ 478 │ │ │ self._move_model_to_device(model, args.device) │ │ 479 │ │ │ │ 480 │ │ # Force n_gpu to 1 to avoid DataParallel as MP will manage the GPUs │ │ 481 │ │ if self.is_model_parallel: │ │ │ │ /data/oobabooga/Chinese-Vicuna/env/lib/python3.10/site-packages/transformers/trainer.py:717 in │ │ _move_model_to_device │ │ │ │ 714 │ │ self.callback_handler.remove_callback(callback) │ │ 715 │ │ │ 716 │ def _move_model_to_device(self, model, device): │ │ ❱ 717 │ │ model = model.to(device) │ │ 718 │ │ # Moving a model to an XLA device disconnects the tied weights, so we have to re │ │ 719 │ │ if self.args.parallel_mode == ParallelMode.TPU and hasattr(model, "tie_weights") │ │ 720 │ │ │ model.tie_weights() │ │ │ │ /data/oobabooga/Chinese-Vicuna/env/lib/python3.10/site-packages/torch/nn/modules/module.py:989 │ │ in to │ │ │ │ 986 │ │ │ │ │ │ │ non_blocking, memory_format=convert_to_format) │ │ 987 │ │ │ return t.to(device, dtype if t.is_floating_point() or t.is_complex() else No │ │ 988 │ │ │ │ ❱ 989 │ │ return self._apply(convert) │ │ 990 │ │ │ 991 │ def register_backward_hook( │ │ 992 │ │ self, hook: Callable[['Module', _grad_t, _grad_t], Union[None, Tensor]] │ │ │ │ /data/oobabooga/Chinese-Vicuna/env/lib/python3.10/site-packages/torch/nn/modules/module.py:641 │ │ in _apply │ │ │ │ 638 │ │ │ 639 │ def _apply(self, fn): │ │ 640 │ │ for module in self.children(): │ │ ❱ 641 │ │ │ module._apply(fn) │ │ 642 │ │ │ │ 643 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 644 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor_applied): │ │ │ │ /data/oobabooga/Chinese-Vicuna/env/lib/python3.10/site-packages/torch/nn/modules/module.py:641 │ │ in _apply │ │ │ │ 638 │ │ │ 639 │ def _apply(self, fn): │ │ 640 │ │ for module in self.children(): │ │ ❱ 641 │ │ │ module._apply(fn) │ │ 642 │ │ │ │ 643 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 644 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor_applied): │ │ │ │ /data/oobabooga/Chinese-Vicuna/env/lib/python3.10/site-packages/torch/nn/modules/module.py:641 │ │ in _apply │ │ │ │ 638 │ │ │ 639 │ def _apply(self, fn): │ │ 640 │ │ for module in self.children(): │ │ ❱ 641 │ │ │ module._apply(fn) │ │ 642 │ │ │ │ 643 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 644 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor_applied): │ │ │ │ /data/oobabooga/Chinese-Vicuna/env/lib/python3.10/site-packages/torch/nn/modules/module.py:641 │ │ in _apply │ │ │ │ 638 │ │ │ 639 │ def _apply(self, fn): │ │ 640 │ │ for module in self.children(): │ │ ❱ 641 │ │ │ module._apply(fn) │ │ 642 │ │ │ │ 643 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 644 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor_applied): │ │ │ │ /data/oobabooga/Chinese-Vicuna/env/lib/python3.10/site-packages/torch/nn/modules/module.py:641 │ │ in _apply │ │ │ │ 638 │ │ │ 639 │ def _apply(self, fn): │ │ 640 │ │ for module in self.children(): │ │ ❱ 641 │ │ │ module._apply(fn) │ │ 642 │ │ │ │ 643 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 644 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor_applied): │ │ │ │ /data/oobabooga/Chinese-Vicuna/env/lib/python3.10/site-packages/torch/nn/modules/module.py:641 │ │ in _apply │ │ │ │ 638 │ │ │ 639 │ def _apply(self, fn): │ │ 640 │ │ for module in self.children(): │ │ ❱ 641 │ │ │ module._apply(fn) │ │ 642 │ │ │ │ 643 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 644 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor_applied): │ │ │ │ /data/oobabooga/Chinese-Vicuna/env/lib/python3.10/site-packages/torch/nn/modules/module.py:641 │ │ in _apply │ │ │ │ 638 │ │ │ 639 │ def _apply(self, fn): │ │ 640 │ │ for module in self.children(): │ │ ❱ 641 │ │ │ module._apply(fn) │ │ 642 │ │ │ │ 643 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 644 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor_applied): │ │ │ │ /data/oobabooga/Chinese-Vicuna/env/lib/python3.10/site-packages/torch/nn/modules/module.py:641 │ │ in _apply │ │ │ │ 638 │ │ │ 639 │ def _apply(self, fn): │ │ 640 │ │ for module in self.children(): │ │ ❱ 641 │ │ │ module._apply(fn) │ │ 642 │ │ │ │ 643 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 644 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor_applied): │ │ │ │ /data/oobabooga/Chinese-Vicuna/env/lib/python3.10/site-packages/torch/nn/modules/module.py:664 │ │ in _apply │ │ │ │ 661 │ │ │ # track autograd history of param_applied, so we have to use │ │ 662 │ │ │ # with torch.no_grad(): │ │ 663 │ │ │ with torch.no_grad(): │ │ ❱ 664 │ │ │ │ param_applied = fn(param) │ │ 665 │ │ │ should_use_set_data = compute_should_use_set_data(param, param_applied) │ │ 666 │ │ │ if should_use_set_data: │ │ 667 │ │ │ │ param.data = param_applied │ │ │ │ /data/oobabooga/Chinese-Vicuna/env/lib/python3.10/site-packages/torch/nn/modules/module.py:987 │ │ in convert │ │ │ │ 984 │ │ │ if convert_to_format is not None and t.dim() in (4, 5): │ │ 985 │ │ │ │ return t.to(device, dtype if t.is_floating_point() or t.is_complex() els │ │ 986 │ │ │ │ │ │ │ non_blocking, memory_format=convert_to_format) │ │ ❱ 987 │ │ │ return t.to(device, dtype if t.is_floating_point() or t.is_complex() else No │ │ 988 │ │ │ │ 989 │ │ return self._apply(convert) │ │ 990 │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ NotImplementedError: Cannot copy out of meta tensor; no data!

Any help is appreciated. Thanks, Steve

thusinh1969 commented 1 year ago

It does NOT seem to run on any docker! I faced the same issue with docker, CUDA 11.8, Python 3.10, 2080ti with Ubuntu 20.04 both host and docker itself. Any idea why ?!?!? I can run on native Windows OK btw.

I tried this to avoid offlonading into CPU, still isses remained

model = LlamaForCausalLM.from_pretrained( args.model_path, return_dict=True, load_in_8bit=False, # Use GPU, offloading error Torch !!! torch_dtype=torch.float16, device_map=device_map, # "auto" )

Thanks, Steve

NotImplementedError: Cannot copy out of meta tensor; no data!

thusinh1969 commented 1 year ago

Updated: I can NOT even run on native Ubuntu 20.04 host !!! Help.

thusinh1969 commented 1 year ago

Fixed:

pip uninstall transformers pip install git+https://github.com/huggingface/transformers@ff20f9cf3615a8638023bc82925573cb9d0f3560 pip install transformers==4.28.1

pip uninstall peft pip install git+https://github.com/huggingface/peft@e536616888d51b453ed354a6f1e243fecb02ea08

Also in finetune.py: model = LlamaForCausalLM.from_pretrained( args.model_path, load_in_8bit=False, # Fastest torch_dtype=torch.float16, device_map=device_map, )

Facico / Chinese-Vicuna

WSL2 docker: NotImplementedError: Cannot copy out of meta tensor; no data! #191

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues