Add help strings for configs and CLI

In one of the Studios I added help strings for configs because it was requested. Users might explore configs before going to run --help in the CLI, so it might make sense to add them. Furthermore, I also think the order in which the arguments are presented in the config is important. Relevant ones should be at the top, less frequently used ones at the bottom. We don't have such an ordering at the moment afaik.

Sharing these here in case we want to reuse some of these descriptions.

Pretraining

# The name of the model to pretrain
# Choose from names in litgpt/config.py
model_name: tiny-llama-1.1b
# Where to save checkpoints and logs
# If run in a MMT job, look for it in /teamspace/jobs/<job-name>/share
out_dir: out/pretrain/tiny-llama
# Path to a checkpoint dir to resume from in case training got interrupted
resume: false
# The name of the logger to send metrics to. Choose from 'tensorboard', 'csv', 'wandb'
logger_name: tensorboard
# Dataset arguments
data:
  class_path: TinyLlama
  init_args:
    data_path: /teamspace/s3_connections/tinyllama-template
train:
  # The length of the input sequences to train on, also known as "context size"
  max_seq_length: 2048
  # After how many optimization steps to save a checkpoint
  save_interval: 1000
  # After how many optimization steps to log metrics
  log_interval: 1
  # The batch size across all GPUs in a machine
  global_batch_size: 512
  # The batch size to use for gradient accumulation
  # Maximize this value based on the available GPU VRAM
  micro_batch_size: 1
  # How many epochs to train for. Mutually exclusive with max_tokens and max_steps
  epochs: null
  # How many tokens to train for (total across all GPUs). Mutually exclusive with epochs and max_steps
  max_tokens: 3000000000000
  # How many optimization steps to train for. Mutually exclusive with epochs and max_tokens
  max_steps: null
  # For how many optimization steps to warm up the learning rate
  lr_warmup_steps: 2000
  # The max learning rate after linear warmup
  learning_rate: 4e-4
  # The minimum learning rate after cosine decay
  min_lr: 4.0e-05
  # How much weight decay to use in AdamW
  weight_decay: 1e-1
  # Beta parameters for AdamW
  beta1: 0.9
  beta2: 0.95
  # Clip gradients to this norm
  max_norm: 1.0
  # Whether to tie embeddings (depends on the model)
  tie_embeddings: null
eval:
  # After how many optimization steps to run validation
  interval: 1000
  # How many tokens to generate during validation
  max_new_tokens: null
  # How many batches to run during validation
  max_iters: 100
# Path to the tokenizer dir that was used for preprocessing the dataset
tokenizer_dir: tokenizer/Llama-2-7b-hf
# How many devices/GPUs to use
devices: auto
# The random seed to initialize the weights of the model
seed: 42

Finetuning

# The path to the base model checkpoint dir to load for finetuning
checkpoint_dir: checkpoints/TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T
# Where to save checkpoints and logs
out_dir: out/
# The precision to use for finetuning. Possible choices: bf16-true, bf16-mixed, 32-true
precision: bf16-true
# If set, quantize the model with this algorithm. 
# Possible choices: bnb.nf4, bnb.nf4-dq, bnb.fp4, bnb.fp4-dq, bnb.int8-training
quantize: null
# How many devices/GPUs to use
devices: 1
# The LoRA hyperparameters
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_query: true
lora_key: false
lora_value: true
lora_projection: false
lora_mlp: false
lora_head: false
# The name of the logger to send metrics to. Choose from 'tensorboard', 'csv', 'wandb'
logger_name: tensorboard
# Dataset arguments
data:
  class_path: litgpt.data.Alpaca2k
  init_args:
    # Whether to include the prompt part in the optimization
    mask_prompt: false
    # The prompt style to use. See litgpt/prompts.py for possible choices.
    prompt_style: alpaca
    # The seed to use for creating the train/val splits and shuffling the data.
    seed: 42
    # The number of workers to use per GPU for dataloading
    num_workers: 4
    # Where to download the data
    download_dir: data/alpaca2k
train:
  # The length of the input sequences to train on, also known as "context size"
  # This depends on how long the sequences in your finetuning dataset are, and
  # whether you want to truncate to save memory
  max_seq_length: 512
  # After how many optimization steps to save a checkpoint
  save_interval: 800
  # After how many optimization steps to log metrics
  log_interval: 1
  # The batch size across all GPUs in a machine
  global_batch_size: 8
  # The batch size to use for gradient accumulation
  # Maximize this value based on the available GPU VRAM
  micro_batch_size: 8
  # How many epochs to train for. Mutually exclusive with max_tokens and max_steps
  epochs: 4
  # How many tokens to train for (total across all GPUs). Mutually exclusive with epochs and max_steps
  max_tokens: null
  # How many optimization steps to train for. Mutually exclusive with epochs and max_tokens
  max_steps: null
  # For how many optimization steps to warm up the learning rate
  lr_warmup_steps: 10
  # The max learning rate after linear warmup
  learning_rate: 0.0002
  # The minimum learning rate after cosine decay
  min_lr: 6.0e-05
  # How much weight decay to use in AdamW
  weight_decay: 0.0
  # Beta parameters for AdamW
  beta1: 0.9
  beta2: 0.95
  # Clip gradients to this norm
  max_norm: null
  # Whether to tie embeddings (depends on the model)
  tie_embeddings: null
eval:
  # After how many optimization steps to run validation
  interval: 100
  # How many tokens to generate during validation
  max_new_tokens: 100
  # How many batches to run during validation
  max_iters: 100
# The random seed to initialize the weights of the model
seed: 1337

Lightning-AI / litgpt

Add help strings for configs and CLI #1087