amosproj / amos2024ss08-cloud-native-llm

MIT License
6 stars 1 forks source link

Study and Research on LLM Hyper-parameters #92

Closed grayJiaaoLi closed 2 days ago

grayJiaaoLi commented 2 weeks ago

User story

  1. As a ML engineer
  2. I want/need to study and research on LLM hyperparameters
  3. So that we can optimise the LLM during fine-tuning

Acceptance criteria

Definition of done (DoD)

DoD general criteria

anosh-ar commented 1 week ago

Hyperparameters:

source: https://towardsdatascience.com/understanding-learning-rates-and-how-it-improves-performance-in-deep-learning-d0d4059c1c10

Techniques to avoid overfitting:

Early stopping:

source: https://towardsdatascience.com/the-million-dollar-question-when-to-stop-training-deep-learning-models-fa9b488ac04d

Weight decay (WD): It is a regularization technique. It penalizes large weights to avoid overfitting. the bigger WD the more penalizing, model learns slower and less overfitting.

Gradient Clipping:

This technique is used to prevent the gradients from becoming too large during the training process, which can lead to unstable training and the "exploding gradient" problem. By capping the gradients to a maximum norm, gradient clipping ensures that the updates to the model's weights are within a reasonable range. Range: [0.1, 0.5]

Learning rate scheduler:

Learning rate should not be constant during training. It should be higher in beginning to be able to take large steps to find global minima and lower at final epochs to find minima in specific are with small steps.

Recommended choice: Cosine scheduler

source: Prof. A. Maier lecture slides "Deep learning" in FAU

How to find best fitting hyper parameters for our model?

With optuna!

There are some libraries which systematically search for hyper-parameters for our model like optuna and Ray. We don't need to manually try them. Optuna does it in a systematic manner using methods like greed search, Bayesian optimizer and Tree-structured Parzen Estimator (TPE).

Recommended hyperparameter searcher library: Optuna

Recommended optimization method: TPE

A link to how to implement optuna in Huggingface: https://huggingface.co/docs/transformers/hpo_train

anosh-ar commented 1 week ago

LoRA Hyper parameters

Rank(r): The parameter r controls the dimensionality of matrixes to update weights. Larger r means more precise updates. But as recent papers mentioned if you apply LoRA to all layers(usually this is the case), any r greater than 8 is fine and is not different in result. Anyway, higher r means more computation burden. Range: [8, 16, 32, 64]

Alpha: When the weight changes are added back into the original model weights, they are multiplied by a scaling factor that’s calculated as alpha divided by rank. Influence of fine tuning can be controlled with below division. Actually if we needed to increase efffect of finetuning we should increase alpha. $$influence ∝ {alpha \over rank}$$ Range: [16, 32, 64, 128, 256]

Dropout: Is nothing new, just like dropout in neural networks. It removes portion of weights to help model come out of local minima. Range: [0 - 0.1] Source: https://www.entrypointai.com/blog/lora-fine-tuning/