acon96 / home-llm

A Home Assistant integration & Model to control your smart home using a Local LLM
631 stars 64 forks source link

Training fails #37

Closed RobertLukan closed 8 months ago

RobertLukan commented 8 months ago

First I would like to say thank you very much for sharing this project with us. I managed to get Home model running and I can control some lights at my home. I am still exploring voice setup in Home Assistant(wyoming, Rhasspy, TTS, STT).

I have Nvidia 4070 GPU with "12GB" of VRAM that runs on another server. I have access to some High End GPUs, but I need to learn how to train models on my hardware before I use precious time of High End GPUs.

For now I just tried to re-do your work but I am getting an error(as shown below). Does anyone have any idea what could be the problem ?

(.venv) root@AI-NVIDIA-VM:~/AI/home-llm# python3 train.py \ --run_name home-llm-rev11_1 \ --base_model microsoft/phi-2 \ --add_pad_token \ --add_chatml_tokens \ --bf16 \ --train_dataset data/home_assistant_alpaca_merged_train.json \ --test_dataset data/home_assistant_alpaca_merged_test.json \ --learning_rate 1e-5 \ --save_steps 1000 \ --micro_batch_size 2 --gradient_checkpointing \ --ctx_size 2048 \ --use_lora --lora_rank 32 --lora_alpha 64 --lora_modules fc1,fc2,Wqkv,out_proj --lora_modules_to_save wte,lm_head.linear --lora_merge Loading model 'microsoft/phi-2'... Model will target using 10997.6875MiB of VRAM config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 866/866 [00:00<00:00, 7.95MB/s] configuration_phi.py: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.26k/9.26k [00:00<00:00, 39.1MB/s] A new version of the following files was downloaded from https://huggingface.co/microsoft/phi-2:

acon96 commented 8 months ago

Microsoft has made changes to the base model on huggingface recently that broke bf16 training so you can use a previous revision. On top of that they renamed the internal modules that you need to LoRA and I still need to update the Readme to reflect that.

So 2 fixes:

  1. add model_kwargs["revision"] = "accfee56d8988cae60915486310362db5831b1bd" to line 120 on train.py
  2. change the lora_modules and lora_modules_to_save arguments to --lora_modules fc1,fc2,q_proj,v_proj,dense --lora_modules_to_save embed_tokens,lm_head
Anto79-ops commented 8 months ago

hey all, thanks for sharing this. I can appreciate the fact that you want to start off with the existing model, and see if it works.

@lunamidori5 tried here with the new dolphin7b model, but it failed at the end for other reasons.

Just curious what plans you have for other models and if you're willing to host/share the model. thanks

RobertLukan commented 8 months ago

Thank you guys for your quick help. I can train based on @Anto79-ops example. It will take about 40 hours, not sure really, as time to finish fluctuates.

I will try to apply the patches provided from @acon96

My goal is kind of a simple, to try different models that can effectively drive home assistant and they are somehow not to chatty. I know a lot to ask :) But this is my side project now. I have installed facial recognition at my home, so that I am greeted when I come home. Now I want to connect chat bot with those automations so I can have a small chat when I come home :)

colaborat0r commented 8 months ago

Nice. What did you use for facial recognition? And did you manage to get a custom voice?

RobertLukan commented 8 months ago

RPI CM4 with my own custom carrier board with 2 cameras(the project is my Github repositories) using mediamtx. Sending 2 x h264 streams to Frigate(with Google Coral). Hooking Double-take software with CompreFace(with another Google Coral). So I get recognised "objects/people" back to Home Assistant. This part is working quite good, just doing some minor adjustments.

From this point onwards I am improvising and testing options. Right now I am using Node-red that is listening to Home Assistant events and is making HTTP call to Rhasppy that is doing TTS. And this is working quite well. But unfortunately Rhasspy is not really developed anymore and "replacement" for Rhasspy is wyoming-satellite. But this is not working really great. So I think I will wait a bit so that (hopefully) wyoming-satellite becomes a mature product. In between I will play a bit with chat-bots.

lunamidori5 commented 8 months ago

Thank you guys for your quick help. I can train based on @Anto79-ops example. It will take about 40 hours, not sure really, as time to finish fluctuates.

I will try to apply the patches provided from @acon96

My goal is kind of a simple, to try different models that can effectively drive home assistant and they are somehow not to chatty. I know a lot to ask :) But this is my side project now. I have installed facial recognition at my home, so that I am greeted when I come home. Now I want to connect chat bot with those automations so I can have a small chat when I come home :)

There are alot of things wrong with that command example, it will fail at the end. I recommend doing a code review before training.

RobertLukan commented 8 months ago

Ok understood. I will find another way. Thank you for your help.

acon96 commented 8 months ago

Thank you guys for your quick help. I can train based on @Anto79-ops example. It will take about 40 hours, not sure really, as time to finish fluctuates. I will try to apply the patches provided from @acon96 My goal is kind of a simple, to try different models that can effectively drive home assistant and they are somehow not to chatty. I know a lot to ask :) But this is my side project now. I have installed facial recognition at my home, so that I am greeted when I come home. Now I want to connect chat bot with those automations so I can have a small chat when I come home :)

There are alot of things wrong with that command example, it will fail at the end. I recommend doing a code review before training.

If you are attempting to train a model that is not Phi-1.5 or Phi-2 then you will need very different settings for the training run. Also, I only was using the custom script because Phi did not have support for fine tuning in any fine-tuning scripts when I started this project.

You are probably better off taking the dataset and using a script such as OpenAccess-AI-Collective/axolotl for fine tuning more popular model architectures.

acon96 commented 8 months ago

The readme is updated with the fixed training invocation