Summary

This PR contains a lot of changes. Normally, we'd split the release development into smaller PRs which is a more friendly way to review new features, changes and enhancements. But that's it, we'll improve on that in the next one. ;]

We focused on making the data processing and training pipelines modular support a broader variety of tasks. We also provided a cli supporting user packages and plugins for simpler interaction with built-in and custom tasks.

Patch Notes

Common

Registrable components

From now on, you can create classes with registry to support a name-based retrieval of registered types.
```
class Animal(Registrable):
...

@Animal.register("dog")
class Dog(Animal):
...
```
This is widely used across the library to improve the modular nature of components.
Registrable subcommands

Subcommands are also registrable types that support the brand new DataclassArgumentParser.
Dataclass-based argument parsing

The brand new DataclassArgumentParser class supports DataclassBase types (basically, plain dataclass types with some built-in helpers) for building cli applications using the static typed attributes. This is helpful to unify the parameters that components need to instantiate new instances which can be passed through the terminal.

Data Preprocessing Pipeline

Modular dataset converter

DatasetConverter is a base class for all dataset converted used to map raw datasets into the ready for binarization ones.
Modular dataset binarizer

Binarizer is now an abstract type that can be replaced with a certain implementation. It is recommended to use 🤗Datasets to provide a very smooth and efficient way to preprocess the datasets.

Tokenization Pipeline

Tokenizers with special tokens

We added default special tokens to our TransformerTokenizerFast. Note, that SpecialTokens is CPP (Code Processing Pipelines) unit that is likely to be moved from being a part of TransformerTokenizerFast.
Tokenizer trainable module

We added a special class TokenizerModule for defining the behavior of arbitrary tokenizers – namely, training and processing.

Training Pipeline

Task-based modules that define the model, data and tokenizer

We added the TaskModule that represents your custom tasks. It expects a Tokenizer, ModuleType and DataModuleType. See TaskModule class documentation for learning more about the underlying types.
Task-oriented trainer on top of PyTorch Lightning

Our trainer class TransformerTrainer is built on top of pytorch_lightning.Trainer to support our custom TaskModule objects and provide a smooth Lightning Trainer setup and monitoring.
Lightning Metrics instead of functional metrics

We introduced the Perplexity metric based on Lightning Metrics module. This ensures that metrics calculation is correct during the DDP training.
Step-based checkpoint saving callback

We added the SaveCheckpointAtStep callback that saves your PyTorch checkpoint with Lightning states at a certain step that you define. It also supports customizing the monitor that will be used to save the best checkpoint once metrics are improved for the selected monitor.

CLI Support

CLI subcommands with dataclass arg parsing

All cli subcommands support dataclass params types for parsing user arguments.
Registering third-party packages & userdir to register plugins

You can either pass a name if a package to --package-name <name> argument or pass a directory with your custom components to register while running via --userdir <dir> argument. Note, once receiving a user directory with plugins we immediately register all sub-components in the beginning of a runtime so all of them will be visible (including in cli help information).

Documentation

Basic CLI subcommands docs

We also started looking into building the docs page. Some early drafts of our CLI docs can be found by docs/cli.

Testing

CLI subcommands smoke testing

The cli application supports basic unit testing to make sure the pipelines are runnable and deterministic. We are to add validation tests in the coming releases.

Additional Info

It's also worth noting that we moved to Notion to track our internal vision and milestones for this project. We're likely to still support GitHub Issues and Projects to mirror the top-level stuff to the community. :]

formermagic / formerbox

Modular cli framework for data processing pipelines and training #12