Quick Summary

This PR mainly improves the training pipeline.

Add DatasetIterator class for supporting uniform batching in a distributed mode
Move all data loading and setup to a brand new LightningDataModule subclass (see TransformerDataModule)
Infere dataset type on reading
Add a checkpoint callback that saves checkpoints every N steps
Modularize the training pipeline with pytorch-lightning

Training Examples

Training Script

To run training with a script you need to specify the following files:

indexed dataset file(s) (--train_dataset_prefix is required)
- see data preprocessing docs
pretrained tokenizer files
- see tokenization docs
model and tokenizer config
- see an example below

A config file contains the model and config class names, config init parameters + tokenizer class and init parameters.

model:
  name: transformers.RobertaForMaskedLM
  config: transformers.RobertaConfig
  params:
    hidden_size: 768
    num_hidden_layers: 2
    num_attention_heads: 8
    intermediate_size: 3072
    hidden_act: gelu
    max_position_embeddings: 514
    hidden_dropout_prob: 0.1
    attention_probs_dropout_prob: 0.1
    type_vocab_size: 2
    initializer_range: 0.02
    layer_norm_eps: 1.0e-12
    gradient_checkpointing: false

tokenizer:
  name: gitnetic.tasks.codebert.CodeBertTokenizerFast
  params:
    add_prefix_space: false
    trim_offsets: true
    lowercase: true

export WANDB_API_KEY=<KEY>

python -m gitnetic.tasks.base_transformers.base_trainer \
    --gpus 1 \
    --num_nodes 1 \
    --distributed_backend ddp \
    --max_steps 100000 \
    --config_path $model_config \
    --tokenizer_path $tokenizer_path \
    --train_data_prefix $train_prefix_path \
    --val_data_prefix $val_prefix_path \
    --num_workers 16 \
    --max_tokens 2048 \
    --warmup_steps 1000 \
    --learning_rate 5e-5 \
    --power 1.0 \
    --save_step_frequency 1000 \
    --save_dir $save_dir \
    --val_check_interval 5000 \
    --precision 16 \
    --progress_bar_refresh_rate 20 \
    --row_log_interval 20 \
    --wandb_project test_proj --wandb_name exp-aug-26-gpu \
    --seed 17 \
    --resume_from_checkpoint $save_dir/checkpoint_last.ckpt

Training with Code

Note, that the code below is likely to be outdated.

You can run the training pipeline in code. Follow these steps:

Load a tokenizer
Create and configure a model (one of available transformers, e.g. RobertaForMaskedLM)
Create a transformer datamodule
Create a transformer module
Create a trainer (TransformerTrainer) with datamodule and module from step 4
Call the train method with args from pytorch-light trainer

See the full list of args at pytorch-lightning docs

from transformers import RobertaForMaskedLM, RobertaConfig, RobertaTokenizer
from gitnetic.tasks.base_transformers import (
    TrainingParams, 
    TransformerDataModule, 
    TransformerModule,
    TransformerTrainer
)

# step 1
tokenizer = RobertaTokenizer.from_pretrained(<PATH>)

# step 2
model_config = RobertaConfig(
        vocab_size=tokenizer.vocab_size,
        pad_token_id=tokenizer.pad_token_id,
        bos_token_id=tokenizer.bos_token_id,
        eos_token_id=tokenizer.eos_token_id,
)

model = RobertaForMaskedLM(model_config)

# step 3
datamodule = TransformerDataModule(
    tokenizer=tokenizer,
    train_data_prefix=<PATH>,
    val_data_prefix=<PATH>,
    max_tokens=<VAL>,
    batch_size=<VAL>,
    num_workers=<VAL>,
)

# step 4
training_params = TrainingParams(
    weight_decay=0.01,
    warmup_steps=4_000,
    learning_rate=5e-4,
    power=1.0,
)

module = TransformerModule(model, tokenizer, training_params)

# step 5
trainer = TransformerTrainer(transformer_module, transformer_datamodule)
trainer.train({
    "gpus": "1",
    "num_nodes": 1,
    "distributed_backend": "ddp",
    "max_steps": 100_000,
    "save_step_frequency": 1_000,
    "save_dir": <PATH>,
    "val_check_interval": 5_000,
    "precision": 16,
    "progress_bar_refresh_rate": 20,
    "row_log_interval": 20,
    "seed": 17,
})

Further Improvements

Dataset Iterators

We should consider adding new dataset iterators for handling large datasets.

Consider looking at fairseq iterators and infinibatch iterators.

Save Checkpoint Callback

Our current checkpoint callback is not designed to support tensorflow I/O nor does it support pytorch-lightning debugging. We should consider improving it. The callback also doesn't generalize the monitoring metrics and monitoring operators (e.g. accuracy should be maximizing whereas loss should be minimized).

Optimization

We don't provide any configuration regarding optimization. Choosing an optimizer and scheduler with hyper parameters should be a great feature.

Modular Trainer

Current TransformerTrainer takes already built components (a module and datamodule). This is quite flexible to some extent but still can be improved. Building other components is not unified and every new setting imposes copy-pasting make_ functions to setup and parse arguments and them to build functional components.

With this said, I would propose a unified components builder with dependency injection to provide building flexibility and types substitution all across the system.

formermagic / formerbox

DataModule and Functional Trainer #9