bigcode-project / starcoder

Home of StarCoder: fine-tuning & inference!
Apache License 2.0
7.33k stars 522 forks source link

How to save and load custom finetune #119

Open LazerJesus opened 1 year ago

LazerJesus commented 1 year ago

I am trying to further finetune Starchat-Beta, save my progress, load my progress, and continue training. But whatever I do, it doesn't come together. Whenever I load my progress and continue training, my loss starts back from zero (3.xxx in my case). I'll run you through my code and then the problem.

tokenizer = AutoTokenizer.from_pretrained(BASEPATH)
model = AutoModelForCausalLM.from_pretrained(
    "/notebooks/starbaseplus"
    ...
)
# I get both the Tokenizer and the Foundation model from the starbaseplus repo (which I have locally). 

peftconfig = LoraConfig(
    "/notebooks/starchat-beta" 
    base_model_name_or_path = "/notebooks/starbaseplus",
    ...
)
model = get_peft_model(model, peftconfig)
# All Gucci so far, the model and the LoRA fine-tune are loaded from the starchat-beta repo (also local).

# important for later:
print_trainable_parameters(model)
# trainable params: 306 Million || all params: 15 Billion || trainable: 1.971%

trainer = Trainer(
    model=model,
    ...
)
trainer.train()
# I train, loss drops. from 3.xx to 1.xx.

# Now, either I follow the HugginFace docks:
model.save_pretrained("./huggingface_model") 
# -> saves /notebooks/huggingface_model/adapter_model.bin 16mb.

# or an alternative I found on SO:
trainer.save_model("./torch_model") 
# -> saves /notebooks/torch_model/pytorch_model.bin 60gb.

I have two alternatives saved to disk. Lets restart and try either of these approaches

First the huggingface docs approach: I now have three sets of weights.

  1. the foundation model - starbase plus
  2. the chat finetune - starchat-beta
  3. the 16mb saved bin - adapter_model.bin

But I only have two opportunities to load weights.

  1. AutoModelForCausalLM.from_pretrained
  2. either get_peft_model or PeftModel.from_pretrained

Neither works. training restarts at a loss of 3.x.

Second approach: Load the 60bg instead of the old starchat-beta repo model. get_peft_model("/notebooks/torch_model/pytorch_model.bin", peftconfig)

Also doesn't work. The print_trainable_parameters(model) drops to trainable: 0.02% and training restarts at a loss of 3.x

ArmelRandy commented 1 year ago

Hi. I understand that you want to resume your training from your peft checkpoint. You would want to look at this issue to see something similar. However, recent versions of transformers ( >= 4.31.1) take into account the possibility to resume checkpoint from a peft checkpoint (check this).

LazerJesus commented 1 year ago

Hi @ArmelRandy This also isn't clear.

There are 4 different ways to save a model. model.save_pretrained(PATH) torch.save({'model_state_dict': model.state_dict()}) trainer.save_model(PATH) and TrainerArgs(save_strategy='steps').

Which one can I use to store the PeftModelForCausalLM(AutoModelForCausalLM()) and how to load it again?

(I like TrainerArgs least because I can't call it.) (and just using python finetune.py isn't an option.)

yiping-jia commented 1 year ago

Try replacing model = get_peft_model(model, lora_config) with

from peft import PeftModel
model = PeftModel.from_pretrained(model, <YOUR CHECKPOINT PATH>,is_trainable = True)

Does it work?