This PR contains a lot of changes. Normally, we'd split the release development into smaller PRs which is a more friendly way to review new features, changes and enhancements. But that's it, we'll improve on that in the next one. ;]
We focused on making the data processing and training pipelines modular support a broader variety of tasks. We also provided a cli supporting user packages and plugins for simpler interaction with built-in and custom tasks.
Patch Notes
Common
Registrable components
From now on, you can create classes with registry to support a name-based retrieval of registered types.
class Animal(Registrable):
...
@Animal.register("dog")
class Dog(Animal):
...
This is widely used across the library to improve the modular nature of components.
Registrable subcommands
Subcommands are also registrable types that support the brand new DataclassArgumentParser.
Dataclass-based argument parsing
The brand new DataclassArgumentParser class supports DataclassBase types (basically, plain dataclass types with some built-in helpers) for building cli applications using the static typed attributes. This is helpful to unify the parameters that components need to instantiate new instances which can be passed through the terminal.
Data Preprocessing Pipeline
Modular dataset converter
DatasetConverter is a base class for all dataset converted used to map raw datasets into the ready for binarization ones.
Modular dataset binarizer
Binarizer is now an abstract type that can be replaced with a certain implementation. It is recommended to use 🤗Datasets to provide a very smooth and efficient way to preprocess the datasets.
Tokenization Pipeline
Tokenizers with special tokens
We added default special tokens to our TransformerTokenizerFast. Note, that SpecialTokens is CPP (Code Processing Pipelines) unit that is likely to be moved from being a part of TransformerTokenizerFast.
Tokenizer trainable module
We added a special class TokenizerModule for defining the behavior of arbitrary tokenizers – namely, training and processing.
Training Pipeline
Task-based modules that define the model, data and tokenizer
We added the TaskModule that represents your custom tasks. It expects a Tokenizer, ModuleType and DataModuleType. See TaskModule class documentation for learning more about the underlying types.
Task-oriented trainer on top of PyTorch Lightning
Our trainer class TransformerTrainer is built on top of pytorch_lightning.Trainer to support our custom TaskModule objects and provide a smooth Lightning Trainer setup and monitoring.
Lightning Metrics instead of functional metrics
We introduced the Perplexity metric based on Lightning Metrics module. This ensures that metrics calculation is correct during the DDP training.
Step-based checkpoint saving callback
We added the SaveCheckpointAtStep callback that saves your PyTorch checkpoint with Lightning states at a certain step that you define. It also supports customizing the monitor that will be used to save the best checkpoint once metrics are improved for the selected monitor.
CLI Support
CLI subcommands with dataclass arg parsing
All cli subcommands support dataclass params types for parsing user arguments.
Registering third-party packages & userdir to register plugins
You can either pass a name if a package to --package-name <name> argument or pass a directory with your custom components to register while running via --userdir <dir> argument. Note, once receiving a user directory with plugins we immediately register all sub-components in the beginning of a runtime so all of them will be visible (including in cli help information).
Documentation
Basic CLI subcommands docs
We also started looking into building the docs page. Some early drafts of our CLI docs can be found by docs/cli.
Testing
CLI subcommands smoke testing
The cli application supports basic unit testing to make sure the pipelines are runnable and deterministic. We are to add validation tests in the coming releases.
Additional Info
It's also worth noting that we moved to Notion to track our internal vision and milestones for this project. We're likely to still support GitHub Issues and Projects to mirror the top-level stuff to the community. :]
Summary
We focused on making the data processing and training pipelines modular support a broader variety of tasks. We also provided a cli supporting user packages and plugins for simpler interaction with built-in and custom tasks.
Patch Notes
Common
Registrable components
From now on, you can create classes with registry to support a name-based retrieval of registered types.
This is widely used across the library to improve the modular nature of components.
Registrable subcommands
Subcommand
s are also registrable types that support the brand newDataclassArgumentParser
.Dataclass-based argument parsing
The brand new
DataclassArgumentParser
class supportsDataclassBase
types (basically, plain dataclass types with some built-in helpers) for building cli applications using the static typed attributes. This is helpful to unify the parameters that components need to instantiate new instances which can be passed through the terminal.Data Preprocessing Pipeline
Modular dataset converter
DatasetConverter
is a base class for all dataset converted used to map raw datasets into the ready for binarization ones.Modular dataset binarizer
Binarizer
is now an abstract type that can be replaced with a certain implementation. It is recommended to use 🤗Datasets to provide a very smooth and efficient way to preprocess the datasets.Tokenization Pipeline
Tokenizers with special tokens
We added default special tokens to our
TransformerTokenizerFast
. Note, thatSpecialTokens
is CPP (Code Processing Pipelines) unit that is likely to be moved from being a part ofTransformerTokenizerFast
.Tokenizer trainable module
We added a special class
TokenizerModule
for defining the behavior of arbitrary tokenizers – namely, training and processing.Training Pipeline
Task-based modules that define the model, data and tokenizer
We added the
TaskModule
that represents your custom tasks. It expects aTokenizer
,ModuleType
andDataModuleType
. SeeTaskModule
class documentation for learning more about the underlying types.Task-oriented trainer on top of PyTorch Lightning
Our trainer class
TransformerTrainer
is built on top ofpytorch_lightning.Trainer
to support our customTaskModule
objects and provide a smooth Lightning Trainer setup and monitoring.Lightning Metrics instead of functional metrics
We introduced the
Perplexity
metric based on Lightning Metrics module. This ensures that metrics calculation is correct during the DDP training.Step-based checkpoint saving callback
We added the
SaveCheckpointAtStep
callback that saves your PyTorch checkpoint with Lightning states at a certain step that you define. It also supports customizing the monitor that will be used to save the best checkpoint once metrics are improved for the selected monitor.CLI Support
CLI subcommands with dataclass arg parsing
All cli subcommands support dataclass params types for parsing user arguments.
Registering third-party packages & userdir to register plugins
You can either pass a name if a package to
--package-name <name>
argument or pass a directory with your custom components to register while running via--userdir <dir>
argument. Note, once receiving a user directory with plugins we immediately register all sub-components in the beginning of a runtime so all of them will be visible (including in cli help information).Documentation
Basic CLI subcommands docs
We also started looking into building the docs page. Some early drafts of our CLI docs can be found by
docs/cli
.Testing
CLI subcommands smoke testing
The cli application supports basic unit testing to make sure the pipelines are runnable and deterministic. We are to add validation tests in the coming releases.
Additional Info
It's also worth noting that we moved to Notion to track our internal vision and milestones for this project. We're likely to still support GitHub Issues and Projects to mirror the top-level stuff to the community. :]