Fine-tuning configuration of Octocoder

pshlego commented 1 year ago

Hello @Muennighoff,

I have a question about the fine-tuning configuration for starcoder with lora that you shared. I'm encountering an issue when fine-tuning the starcoder with lora using your configuration: the loss doesn't seem to converge.

For your information, I used a training dataset composed of roughly 6,300 text-sql pairs, and the fine-tuning was done on 8 GPUs. Here's the configuration I used:

#transformers.TrainingArguments
{
  "max_steps": 1000,
  "learning_rate": 5e-4,
  "warmup_steps": 5,
  "gradient_accumulation_steps": 4,
  "per_device_train_batch_size": 1,
  "lr_scheduler_type": "cosine",
  "weight_decay": 0.05,
}

#peft.LoraConfig
{
  "task_type": "CAUSAL_LM",
  "target_modules": ["c_proj", "c_attn", "q_attn"],
  "r": 16,
  "lora_alpha": 32,
  "lora_dropout": 0.05,
  "bias": "none",
}

Is there any difference between the configuration used in octocoder when fine-tuning starcoder and our configuration?

I'd greatly appreciate your insights or suggestions on what might be causing this issue.

Thank you!

ArmelRandy commented 1 year ago

Hi @pshlego. Your configuration is very similar to what we used for octocoder. The Lora arguments are exactly the same, however we used fewer training steps. I'll advise you to try different hyperparameters, as they can depend on the dataset you fine-tune on. For your tests, you can try a smaller model (starcoderbase-1b) and see if you can find working parameters etc. You should review your pipeline, check if the dataset has a good formatting and verify its size (packing tend to reduce the dataset size quite a lot) etc.

pshlego commented 1 year ago

Thank you for your answer!

Can you please let me know the number of training data samples and training steps you used for fine-tuning?

ArmelRandy commented 1 year ago

For the training, we set up max_steps=1000 but we evaluated the checkpoints to see how they performed. As far as I know you don't need that much steps. We already had strong results after one epoch (which was around 50 steps, we used 8 GPUs, gradient_accumulation_steps=4, batch_size=1, sequence_length=2048). You can set your number of training steps to match 1 epoch, you'll just have to compute the number of tokens of your processed dataset, divide it by the sequence length and the effective batch size in order to see how many steps you need in order to see your whole dataset.

pshlego commented 1 year ago

Thank you for the detailed answer! It was very helpful.

bigcode-project / octopack

Fine-tuning configuration of Octocoder #19