formermagic / formerbox

MIT License
1 stars 0 forks source link

DataModule and Functional Trainer #9

Closed mozharovsky closed 4 years ago

mozharovsky commented 4 years ago

Quick Summary

This PR mainly improves the training pipeline.

Training Examples

Training Script

To run training with a script you need to specify the following files:

A config file contains the model and config class names, config init parameters + tokenizer class and init parameters.

model:
  name: transformers.RobertaForMaskedLM
  config: transformers.RobertaConfig
  params:
    hidden_size: 768
    num_hidden_layers: 2
    num_attention_heads: 8
    intermediate_size: 3072
    hidden_act: gelu
    max_position_embeddings: 514
    hidden_dropout_prob: 0.1
    attention_probs_dropout_prob: 0.1
    type_vocab_size: 2
    initializer_range: 0.02
    layer_norm_eps: 1.0e-12
    gradient_checkpointing: false

tokenizer:
  name: gitnetic.tasks.codebert.CodeBertTokenizerFast
  params:
    add_prefix_space: false
    trim_offsets: true
    lowercase: true
export WANDB_API_KEY=<KEY>

python -m gitnetic.tasks.base_transformers.base_trainer \
    --gpus 1 \
    --num_nodes 1 \
    --distributed_backend ddp \
    --max_steps 100000 \
    --config_path $model_config \
    --tokenizer_path $tokenizer_path \
    --train_data_prefix $train_prefix_path \
    --val_data_prefix $val_prefix_path \
    --num_workers 16 \
    --max_tokens 2048 \
    --warmup_steps 1000 \
    --learning_rate 5e-5 \
    --power 1.0 \
    --save_step_frequency 1000 \
    --save_dir $save_dir \
    --val_check_interval 5000 \
    --precision 16 \
    --progress_bar_refresh_rate 20 \
    --row_log_interval 20 \
    --wandb_project test_proj --wandb_name exp-aug-26-gpu \
    --seed 17 \
    --resume_from_checkpoint $save_dir/checkpoint_last.ckpt

Training with Code

Note, that the code below is likely to be outdated.

You can run the training pipeline in code. Follow these steps:

  1. Load a tokenizer
  2. Create and configure a model (one of available transformers, e.g. RobertaForMaskedLM)
  3. Create a transformer datamodule
  4. Create a transformer module
  5. Create a trainer (TransformerTrainer) with datamodule and module from step 4
  6. Call the train method with args from pytorch-light trainer

See the full list of args at pytorch-lightning docs

from transformers import RobertaForMaskedLM, RobertaConfig, RobertaTokenizer
from gitnetic.tasks.base_transformers import (
    TrainingParams, 
    TransformerDataModule, 
    TransformerModule,
    TransformerTrainer
)

# step 1
tokenizer = RobertaTokenizer.from_pretrained(<PATH>)

# step 2
model_config = RobertaConfig(
        vocab_size=tokenizer.vocab_size,
        pad_token_id=tokenizer.pad_token_id,
        bos_token_id=tokenizer.bos_token_id,
        eos_token_id=tokenizer.eos_token_id,
)

model = RobertaForMaskedLM(model_config)

# step 3
datamodule = TransformerDataModule(
    tokenizer=tokenizer,
    train_data_prefix=<PATH>,
    val_data_prefix=<PATH>,
    max_tokens=<VAL>,
    batch_size=<VAL>,
    num_workers=<VAL>,
)

# step 4
training_params = TrainingParams(
    weight_decay=0.01,
    warmup_steps=4_000,
    learning_rate=5e-4,
    power=1.0,
)

module = TransformerModule(model, tokenizer, training_params)

# step 5
trainer = TransformerTrainer(transformer_module, transformer_datamodule)
trainer.train({
    "gpus": "1",
    "num_nodes": 1,
    "distributed_backend": "ddp",
    "max_steps": 100_000,
    "save_step_frequency": 1_000,
    "save_dir": <PATH>,
    "val_check_interval": 5_000,
    "precision": 16,
    "progress_bar_refresh_rate": 20,
    "row_log_interval": 20,
    "seed": 17,
})

Further Improvements

Dataset Iterators

We should consider adding new dataset iterators for handling large datasets.

Consider looking at fairseq iterators and infinibatch iterators.

Save Checkpoint Callback

Our current checkpoint callback is not designed to support tensorflow I/O nor does it support pytorch-lightning debugging. We should consider improving it. The callback also doesn't generalize the monitoring metrics and monitoring operators (e.g. accuracy should be maximizing whereas loss should be minimized).

Optimization

We don't provide any configuration regarding optimization. Choosing an optimizer and scheduler with hyper parameters should be a great feature.

Modular Trainer

Current TransformerTrainer takes already built components (a module and datamodule). This is quite flexible to some extent but still can be improved. Building other components is not unified and every new setting imposes copy-pasting make_ functions to setup and parse arguments and them to build functional components.

With this said, I would propose a unified components builder with dependency injection to provide building flexibility and types substitution all across the system.

mozharovsky commented 4 years ago

Merge plan: