Training fails - Githubissues

RobertLukan commented 8 months ago

First I would like to say thank you very much for sharing this project with us. I managed to get Home model running and I can control some lights at my home. I am still exploring voice setup in Home Assistant(wyoming, Rhasspy, TTS, STT).

I have Nvidia 4070 GPU with "12GB" of VRAM that runs on another server. I have access to some High End GPUs, but I need to learn how to train models on my hardware before I use precious time of High End GPUs.

For now I just tried to re-do your work but I am getting an error(as shown below). Does anyone have any idea what could be the problem ?

(.venv) root@AI-NVIDIA-VM:~/AI/home-llm# python3 train.py \ --run_name home-llm-rev11_1 \ --base_model microsoft/phi-2 \ --add_pad_token \ --add_chatml_tokens \ --bf16 \ --train_dataset data/home_assistant_alpaca_merged_train.json \ --test_dataset data/home_assistant_alpaca_merged_test.json \ --learning_rate 1e-5 \ --save_steps 1000 \ --micro_batch_size 2 --gradient_checkpointing \ --ctx_size 2048 \ --use_lora --lora_rank 32 --lora_alpha 64 --lora_modules fc1,fc2,Wqkv,out_proj --lora_modules_to_save wte,lm_head.linear --lora_merge Loading model 'microsoft/phi-2'... Model will target using 10997.6875MiB of VRAM config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 866/866 [00:00<00:00, 7.95MB/s] configuration_phi.py: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.26k/9.26k [00:00<00:00, 39.1MB/s] A new version of the following files was downloaded from https://huggingface.co/microsoft/phi-2:

configuration_phi.py . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision. modeling_phi.py: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 62.7k/62.7k [00:00<00:00, 600kB/s] A new version of the following files was downloaded from https://huggingface.co/microsoft/phi-2:
modeling_phi.py . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision. model.safetensors.index.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35.7k/35.7k [00:00<00:00, 341kB/s] model-00001-of-00002.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5.00G/5.00G [01:10<00:00, 70.9MB/s] model-00002-of-00002.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 564M/564M [00:08<00:00, 70.2MB/s] Downloading shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [01:18<00:00, 39.46s/it] Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.35it/s] generation_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 74.0/74.0 [00:00<00:00, 655kB/s] tokenizer_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7.34k/7.34k [00:00<00:00, 33.9MB/s] vocab.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 798k/798k [00:00<00:00, 2.55MB/s] merges.txt: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 1.46MB/s] added_tokens.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.08k/1.08k [00:00<00:00, 7.03MB/s] special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 99.0/99.0 [00:00<00:00, 799kB/s] tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.11M/2.11M [00:00<00:00, 4.91MB/s] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Creating LoRA for model... trainable params: 26,214,400 || all params: 2,805,898,240 || trainable%: 0.9342605382581515 Loading dataset... Generating train split: 53540 examples [00:00, 249639.03 examples/s] Generating test split: 7141 examples [00:00, 304688.87 examples/s] Tokenizing datasets... Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 53540/53540 [01:31<00:00, 584.24 examples/s] Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7141/7141 [00:12<00:00, 580.09 examples/s] Using auto half precision backend Running training Num examples = 53,540 Num Epochs = 1 Instantaneous batch size per device = 2 Total train batch size (w. parallel, distributed & accumulation) = 8 Gradient Accumulation steps = 4 Total optimization steps = 6,692 Number of trainable parameters = 26,214,400 0%| | 0/6692 [00:00<?, ?it/s]use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False... /root/AI/home-llm/.venv/lib/python3.11/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants. warnings.warn( Something bad happened! Try and save it? Traceback (most recent call last): File "/root/AI/home-llm/train.py", line 385, in trainer.train() File "/root/AI/home-llm/.venv/lib/python3.11/site-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( ^^^^^^^^^^^^^^^^^^^^ File "/root/AI/home-llm/.venv/lib/python3.11/site-packages/transformers/trainer.py", line 1869, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/AI/home-llm/.venv/lib/python3.11/site-packages/transformers/trainer.py", line 2768, in training_step loss = self.compute_loss(model, inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/AI/home-llm/.venv/lib/python3.11/site-packages/transformers/trainer.py", line 2791, in compute_loss outputs = model(inputs) ^^^^^^^^^^^^^^^ File "/root/AI/home-llm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/AI/home-llm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/AI/home-llm/.venv/lib/python3.11/site-packages/accelerate/utils/operations.py", line 687, in forward return model_forward(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/AI/home-llm/.venv/lib/python3.11/site-packages/accelerate/utils/operations.py", line 675, in call return convert_to_fp32(self.model_forward(*args, *kwargs)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/AI/home-llm/.venv/lib/python3.11/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast return func(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/root/AI/home-llm/.venv/lib/python3.11/site-packages/peft/peft_model.py", line 1073, in forward return self.base_model( ^^^^^^^^^^^^^^^^ File "/root/AI/home-llm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/AI/home-llm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/AI/home-llm/.venv/lib/python3.11/site-packages/peft/tuners/tuners_utils.py", line 103, in forward return self.model.forward(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.cache/huggingface/modules/transformers_modules/microsoft/phi-2/85d00b03fee509307549d823fdd095473ba5197c/modeling_phi.py", line 1049, in forward outputs = self.model( ^^^^^^^^^^^ File "/root/AI/home-llm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/AI/home-llm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.cache/huggingface/modules/transformers_modules/microsoft/phi-2/85d00b03fee509307549d823fdd095473ba5197c/modeling_phi.py", line 919, in forward layer_outputs = self._gradient_checkpointing_func( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/AI/home-llm/.venv/lib/python3.11/site-packages/torch/_compile.py", line 24, in inner return torch._dynamo.disable(fn, recursive)(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/AI/home-llm/.venv/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 328, in _fn return fn(*args, kwargs) ^^^^^^^^^^^^^^^^^^^ File "/root/AI/home-llm/.venv/lib/python3.11/site-packages/torch/_dynamo/external_utils.py", line 17, in inner return fn(*args, kwargs) ^^^^^^^^^^^^^^^^^^^ File "/root/AI/home-llm/.venv/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 451, in checkpoint return CheckpointFunction.apply(function, preserve, args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/AI/home-llm/.venv/lib/python3.11/site-packages/torch/autograd/function.py", line 539, in apply return super().apply(args, kwargs) # type: ignore[misc] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/AI/home-llm/.venv/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 230, in forward outputs = run_function(args) ^^^^^^^^^^^^^^^^^^^ File "/root/AI/home-llm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/AI/home-llm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.cache/huggingface/modules/transformers_modules/microsoft/phi-2/85d00b03fee509307549d823fdd095473ba5197c/modeling_phi.py", line 669, in forward attn_outputs, self_attn_weights, present_key_value = self.self_attn( ^^^^^^^^^^^^^^^ File "/root/AI/home-llm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/AI/home-llm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/AI/home-llm/.venv/lib/python3.11/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast return func(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/root/AI/home-llm/.venv/lib/python3.11/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast return func(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/root/.cache/huggingface/modules/transformers_modules/microsoft/phi-2/85d00b03fee509307549d823fdd095473ba5197c/modeling_phi.py", line 319, in forward query_states = self.q_proj(hidden_states) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/AI/home-llm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/AI/home-llm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/AI/home-llm/.venv/lib/python3.11/site-packages/torch/nn/modules/linear.py", line 114, in forward return F.linear(input, self.weight, self.bias) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16 Python 3.11.2 (main, Mar 13 2023, 12:18:29) [GCC 12.2.0] on linux Type "help", "copyright", "credits" or "license" for more information. (InteractiveConsole)

quit() 0%| | 0/6692 [00:46<?, ?it/s]

acon96 commented 8 months ago

Microsoft has made changes to the base model on huggingface recently that broke bf16 training so you can use a previous revision. On top of that they renamed the internal modules that you need to LoRA and I still need to update the Readme to reflect that.

So 2 fixes:

add model_kwargs["revision"] = "accfee56d8988cae60915486310362db5831b1bd" to line 120 on train.py
change the lora_modules and lora_modules_to_save arguments to --lora_modules fc1,fc2,q_proj,v_proj,dense --lora_modules_to_save embed_tokens,lm_head

Anto79-ops commented 8 months ago

hey all, thanks for sharing this. I can appreciate the fact that you want to start off with the existing model, and see if it works.

@lunamidori5 tried here with the new dolphin7b model, but it failed at the end for other reasons.

Just curious what plans you have for other models and if you're willing to host/share the model. thanks

RobertLukan commented 8 months ago

Thank you guys for your quick help. I can train based on @Anto79-ops example. It will take about 40 hours, not sure really, as time to finish fluctuates.

I will try to apply the patches provided from @acon96

My goal is kind of a simple, to try different models that can effectively drive home assistant and they are somehow not to chatty. I know a lot to ask :) But this is my side project now. I have installed facial recognition at my home, so that I am greeted when I come home. Now I want to connect chat bot with those automations so I can have a small chat when I come home :)

colaborat0r commented 8 months ago

Nice. What did you use for facial recognition? And did you manage to get a custom voice?

RobertLukan commented 8 months ago

RPI CM4 with my own custom carrier board with 2 cameras(the project is my Github repositories) using mediamtx. Sending 2 x h264 streams to Frigate(with Google Coral). Hooking Double-take software with CompreFace(with another Google Coral). So I get recognised "objects/people" back to Home Assistant. This part is working quite good, just doing some minor adjustments.

From this point onwards I am improvising and testing options. Right now I am using Node-red that is listening to Home Assistant events and is making HTTP call to Rhasppy that is doing TTS. And this is working quite well. But unfortunately Rhasspy is not really developed anymore and "replacement" for Rhasspy is wyoming-satellite. But this is not working really great. So I think I will wait a bit so that (hopefully) wyoming-satellite becomes a mature product. In between I will play a bit with chat-bots.

lunamidori5 commented 8 months ago

Thank you guys for your quick help. I can train based on @Anto79-ops example. It will take about 40 hours, not sure really, as time to finish fluctuates.

I will try to apply the patches provided from @acon96

My goal is kind of a simple, to try different models that can effectively drive home assistant and they are somehow not to chatty. I know a lot to ask :) But this is my side project now. I have installed facial recognition at my home, so that I am greeted when I come home. Now I want to connect chat bot with those automations so I can have a small chat when I come home :)

There are alot of things wrong with that command example, it will fail at the end. I recommend doing a code review before training.

RobertLukan commented 8 months ago

Ok understood. I will find another way. Thank you for your help.

acon96 commented 8 months ago

Thank you guys for your quick help. I can train based on @Anto79-ops example. It will take about 40 hours, not sure really, as time to finish fluctuates. I will try to apply the patches provided from @acon96 My goal is kind of a simple, to try different models that can effectively drive home assistant and they are somehow not to chatty. I know a lot to ask :) But this is my side project now. I have installed facial recognition at my home, so that I am greeted when I come home. Now I want to connect chat bot with those automations so I can have a small chat when I come home :)

There are alot of things wrong with that command example, it will fail at the end. I recommend doing a code review before training.

If you are attempting to train a model that is not Phi-1.5 or Phi-2 then you will need very different settings for the training run. Also, I only was using the custom script because Phi did not have support for fine tuning in any fine-tuning scripts when I started this project.

You are probably better off taking the dataset and using a script such as OpenAccess-AI-Collective/axolotl for fine tuning more popular model architectures.

acon96 commented 8 months ago

The readme is updated with the fixed training invocation

acon96 / home-llm

Training fails #37