Study and Research on LLM Hyper-parameters

grayJiaaoLi commented 2 weeks ago

User story

As a ML engineer
I want/need to study and research on LLM hyperparameters
So that we can optimise the LLM during fine-tuning

Acceptance criteria

Everyone's work
Learn the commonly used LLM hyperparameters
- Get familiar with the effect they have on a LLM’s output
Explore LLM hyperparameter tuning methods
- Automated tuning solutions: Random Search, Grid Search, Bayesian Optimisation
Learn techniques to avoid Overfitting

Definition of done (DoD)

Bill of Materials in the planning document has been updated
All feature branches have been merged and closed
New feature code has been documented
Potential new licenses have been checked
All GitHub Actions are passing
The requirement.txt is updated

DoD general criteria

Feature has been fully implemented
Feature has been merged into the mainline
All acceptance criteria were met
Product owner approved features
All tests are passing
Developers agreed to release

anosh-ar commented 1 week ago

Hyperparameters:

Batch Size: How many training samples should be used to train in one training iteration. Smaller batch sizes may lead to more stochastic updates, while larger batch sizes may offer better generalization. Usually, we should use the highest batch size which GPU memory allows to benefit from maximum generalization. If you increase batch size you probably need to increase learning rate or epochs to keep the balance of training.

Range: 16, 32, 64, ...
Epochs: How many times the model should train on the whole dataset. Too few epochs may result in underfitting, while too many epochs may lead to overfitting. For finetuning in our case the range would be:

Range: [3-15]
Learning Rate: The learning rate determines the step size taken during training to update the model's weights. How much the model can learn in each step.

Range: [1e-6, - 1e-4]

source: https://towardsdatascience.com/understanding-learning-rates-and-how-it-improves-performance-in-deep-learning-d0d4059c1c10

Techniques to avoid overfitting:

Early stopping:

Training loss usually is decreasing.
Keep eye on validation loss, stop training when validation loss started to increase.
At this epoch we will have the best preforming model.

source: https://towardsdatascience.com/the-million-dollar-question-when-to-stop-training-deep-learning-models-fa9b488ac04d

Weight decay (WD): It is a regularization technique. It penalizes large weights to avoid overfitting. the bigger WD the more penalizing, model learns slower and less overfitting.

Recommended values: [1e-6, 1e-5, 1e-4, 1e-3, 1e-2]
Tune in a logarithmic manner
Weight decay and Learning Rate Interaction. Higher weight decay values may require a lower learning rate and vice versa.

Gradient Clipping:

This technique is used to prevent the gradients from becoming too large during the training process, which can lead to unstable training and the "exploding gradient" problem. By capping the gradients to a maximum norm, gradient clipping ensures that the updates to the model's weights are within a reasonable range. Range: [0.1, 0.5]

Learning rate scheduler:

Learning rate should not be constant during training. It should be higher in beginning to be able to take large steps to find global minima and lower at final epochs to find minima in specific are with small steps.

Recommended choice: Cosine scheduler

source: Prof. A. Maier lecture slides "Deep learning" in FAU

How to find best fitting hyper parameters for our model?

With optuna!

There are some libraries which systematically search for hyper-parameters for our model like optuna and Ray. We don't need to manually try them. Optuna does it in a systematic manner using methods like greed search, Bayesian optimizer and Tree-structured Parzen Estimator (TPE).

Recommended hyperparameter searcher library: Optuna

Recommended optimization method: TPE

A link to how to implement optuna in Huggingface: https://huggingface.co/docs/transformers/hpo_train

anosh-ar commented 1 week ago

LoRA Hyper parameters

Rank(r): The parameter r controls the dimensionality of matrixes to update weights. Larger r means more precise updates. But as recent papers mentioned if you apply LoRA to all layers(usually this is the case), any r greater than 8 is fine and is not different in result. Anyway, higher r means more computation burden. Range: [8, 16, 32, 64]

Alpha: When the weight changes are added back into the original model weights, they are multiplied by a scaling factor that’s calculated as alpha divided by rank. Influence of fine tuning can be controlled with below division. Actually if we needed to increase efffect of finetuning we should increase alpha. $$influence ∝ {alpha \over rank}$$ Range: [16, 32, 64, 128, 256]

Dropout: Is nothing new, just like dropout in neural networks. It removes portion of weights to help model come out of local minima. Range: [0 - 0.1] Source: https://www.entrypointai.com/blog/lora-fine-tuning/

amosproj / amos2024ss08-cloud-native-llm