Added Example 31 for GPT-2 model

Related issues

New example added for the GPT-2 demonstration

Description

This pull request includes a new example implementation for running GPT-2 based model distilgpt2 transformer on the wikitext-2-raw-v1 dataset using AIHWKit. The example demonstrates how to convert the model to analog, run training and inference, and visualize the performance metrics using TensorBoard.

Details

Key Changes and Additions

Model and Dataset:
- Implemented an example using the smallest GPT-2 model (distilgpt2).
- Utilized the wikitext-2-raw-v1 dataset for training and validation, which is smaller and faster to process compared to openwebtext.
Training and Inference Setup:
- Configured the model to use analog inference with specified noise levels.
- Added support for digital inference as an option.
- Implemented preprocessing functions to handle dataset tokenization.
- Provided functionality to train the model and save/load checkpoints.
Logging and Monitoring:
- Integrated TensorBoard for logging training and validation metrics.
- Added TensorBoardCallback to the Trainer for seamless logging.
- Configured the script to save logs in a specific directory and visualize them using TensorBoard.
Performance Metrics:
- Calculated validation loss and perplexity as the primary performance metrics.
- Achieved a validation loss of 4.059

README

Example 31: ['31_gpt2_on_wikitext.py'] This example is adapted from https://github.com/huggingface/notebooks/blob/main/examples/language_modeling.ipynb

The example loads a pre-trained GPT-2 model trained on the wikitext dataset. It then applies convert_to_analog() to examine the effects of drift_analog_weights() on inference performance at different weight noise levels. Tensorboard is used to display the perplexity metrics evaluated using the model at various times after training completed.

Commandline arguments can be used to control certain options. For example: python /path/to/aihwkit/examples/31_gpt2_on_wikitext.py -n 0.1 -r "run 1" -l 0.0005 -t to set the weight noise to 0.1, name the run in Tensorboard "run 1", set the learning rate to 0.0005, and do hardware-aware training.

IBM / aihwkit