Saf9933 commented 1 month ago

Hyperparameter Tuning for LoRA Training Model

Overview

This document describes the implementation of the mlora_train_optuna.py script for automated hyperparameter tuning using Optuna, which was previously missing. The goal is to optimize key hyperparameters like rank, learning rate, alpha, and dropout for the LoRA model to improve performance.

Key Changes

Implementation of mlora_train_optuna.py:
- Added the script to enable hyperparameter tuning using Optuna.
- Defined an objective function for the hyperparameter search and training process.
Training Configuration Updates:
- Modified training files to integrate with Optuna for optimizing rank, alpha, learning rate, and dropout.
- Updated configurations to adjust hyperparameters for each trial.
Training Pipeline Adaptation:
- Updated the training script to generate task-specific configurations dynamically based on Optuna's suggestions.
- Replaced static configurations to enhance flexibility in the training process.
Loss Tracking Implementation:
- Implemented loss tracking for each trial to determine the best hyperparameters.
Flake8 Compliance:
- Modified several files to pass Flake8 linting checks, ensuring adherence to style guidelines.
Task Configuration Changes:
- Adjusted the type field in TrainTaskConfig for compatibility with existing assertions.

Optuna Integration

The objective() function defines the hyperparameter space for:
- Rank: 4 to 64
- Alpha: 0.1 to 10.0
- Learning Rate: 1e-5 to 1e-2 (logarithmic scale)
- Dropout: 0.0 to 0.5
Each trial runs with the suggested parameters, and the loss is recorded to evaluate performance.

Potential Issues

Training Time: Running multiple trials increases training time, which may be a concern with limited computational resources.

Conclusion

Integrating Optuna for hyperparameter tuning aims to automate the optimization process, improve training efficiency, and enhance model performance, reducing manual tuning efforts.

merlintang commented 1 month ago

can we have a design doc for this pr, meanwhile, why we need to change lots of configuration in the training model

Saf9933 commented 4 weeks ago

can we have a design doc for this pr, meanwhile, why we need to change lots of configuration in the training model

I've added a design to the PR description to outline the changes. Regarding the configuration changes, they were necessary to enable hyperparameter tuning using Optuna. This tuning process allows key training parameters such as learning rate, rank, alpha, and dropout to be optimized dynamically rather than being fixed. The goal was to improve the model's performance by letting the tuning process find the optimal configuration for each training trial automatically, thus adding flexibility to the training pipeline.

yezhengmao1 commented 4 weeks ago

can you just only add a new file : mlora_train_optuna.py

TUDB-Labs / mLoRA

Feature/hyperparameter tuning #266