In one of the Studios I added help strings for configs because it was requested. Users might explore configs before going to run --help in the CLI, so it might make sense to add them.
Furthermore, I also think the order in which the arguments are presented in the config is important. Relevant ones should be at the top, less frequently used ones at the bottom. We don't have such an ordering at the moment afaik.
Sharing these here in case we want to reuse some of these descriptions.
Pretraining
# The name of the model to pretrain
# Choose from names in litgpt/config.py
model_name: tiny-llama-1.1b
# Where to save checkpoints and logs
# If run in a MMT job, look for it in /teamspace/jobs/<job-name>/share
out_dir: out/pretrain/tiny-llama
# Path to a checkpoint dir to resume from in case training got interrupted
resume: false
# The name of the logger to send metrics to. Choose from 'tensorboard', 'csv', 'wandb'
logger_name: tensorboard
# Dataset arguments
data:
class_path: TinyLlama
init_args:
data_path: /teamspace/s3_connections/tinyllama-template
train:
# The length of the input sequences to train on, also known as "context size"
max_seq_length: 2048
# After how many optimization steps to save a checkpoint
save_interval: 1000
# After how many optimization steps to log metrics
log_interval: 1
# The batch size across all GPUs in a machine
global_batch_size: 512
# The batch size to use for gradient accumulation
# Maximize this value based on the available GPU VRAM
micro_batch_size: 1
# How many epochs to train for. Mutually exclusive with max_tokens and max_steps
epochs: null
# How many tokens to train for (total across all GPUs). Mutually exclusive with epochs and max_steps
max_tokens: 3000000000000
# How many optimization steps to train for. Mutually exclusive with epochs and max_tokens
max_steps: null
# For how many optimization steps to warm up the learning rate
lr_warmup_steps: 2000
# The max learning rate after linear warmup
learning_rate: 4e-4
# The minimum learning rate after cosine decay
min_lr: 4.0e-05
# How much weight decay to use in AdamW
weight_decay: 1e-1
# Beta parameters for AdamW
beta1: 0.9
beta2: 0.95
# Clip gradients to this norm
max_norm: 1.0
# Whether to tie embeddings (depends on the model)
tie_embeddings: null
eval:
# After how many optimization steps to run validation
interval: 1000
# How many tokens to generate during validation
max_new_tokens: null
# How many batches to run during validation
max_iters: 100
# Path to the tokenizer dir that was used for preprocessing the dataset
tokenizer_dir: tokenizer/Llama-2-7b-hf
# How many devices/GPUs to use
devices: auto
# The random seed to initialize the weights of the model
seed: 42
Finetuning
# The path to the base model checkpoint dir to load for finetuning
checkpoint_dir: checkpoints/TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T
# Where to save checkpoints and logs
out_dir: out/
# The precision to use for finetuning. Possible choices: bf16-true, bf16-mixed, 32-true
precision: bf16-true
# If set, quantize the model with this algorithm.
# Possible choices: bnb.nf4, bnb.nf4-dq, bnb.fp4, bnb.fp4-dq, bnb.int8-training
quantize: null
# How many devices/GPUs to use
devices: 1
# The LoRA hyperparameters
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_query: true
lora_key: false
lora_value: true
lora_projection: false
lora_mlp: false
lora_head: false
# The name of the logger to send metrics to. Choose from 'tensorboard', 'csv', 'wandb'
logger_name: tensorboard
# Dataset arguments
data:
class_path: litgpt.data.Alpaca2k
init_args:
# Whether to include the prompt part in the optimization
mask_prompt: false
# The prompt style to use. See litgpt/prompts.py for possible choices.
prompt_style: alpaca
# The seed to use for creating the train/val splits and shuffling the data.
seed: 42
# The number of workers to use per GPU for dataloading
num_workers: 4
# Where to download the data
download_dir: data/alpaca2k
train:
# The length of the input sequences to train on, also known as "context size"
# This depends on how long the sequences in your finetuning dataset are, and
# whether you want to truncate to save memory
max_seq_length: 512
# After how many optimization steps to save a checkpoint
save_interval: 800
# After how many optimization steps to log metrics
log_interval: 1
# The batch size across all GPUs in a machine
global_batch_size: 8
# The batch size to use for gradient accumulation
# Maximize this value based on the available GPU VRAM
micro_batch_size: 8
# How many epochs to train for. Mutually exclusive with max_tokens and max_steps
epochs: 4
# How many tokens to train for (total across all GPUs). Mutually exclusive with epochs and max_steps
max_tokens: null
# How many optimization steps to train for. Mutually exclusive with epochs and max_tokens
max_steps: null
# For how many optimization steps to warm up the learning rate
lr_warmup_steps: 10
# The max learning rate after linear warmup
learning_rate: 0.0002
# The minimum learning rate after cosine decay
min_lr: 6.0e-05
# How much weight decay to use in AdamW
weight_decay: 0.0
# Beta parameters for AdamW
beta1: 0.9
beta2: 0.95
# Clip gradients to this norm
max_norm: null
# Whether to tie embeddings (depends on the model)
tie_embeddings: null
eval:
# After how many optimization steps to run validation
interval: 100
# How many tokens to generate during validation
max_new_tokens: 100
# How many batches to run during validation
max_iters: 100
# The random seed to initialize the weights of the model
seed: 1337
In one of the Studios I added help strings for configs because it was requested. Users might explore configs before going to run
--help
in the CLI, so it might make sense to add them. Furthermore, I also think the order in which the arguments are presented in the config is important. Relevant ones should be at the top, less frequently used ones at the bottom. We don't have such an ordering at the moment afaik.Sharing these here in case we want to reuse some of these descriptions.
Pretraining
Finetuning