Reinforcement learning experiment configuration and execution documentation

This comment is intended to guide developers on how to configure and execute reinforcement learning experiments for mixed-precision optimization tasks.

Algorithm and Environment Configuration: eg. The mase/machop/configs/examples/search_vgg_rl.toml .

In this parameter section, specify the reinforcement learning algorithm (such as 'ppo'), the target environment (such as 'mixed_precision_paper'), and the training device (usually 'cuda' or 'cpu'). In addition, other algorithm-related parameters need to be set, such as learning rate and training step size.

Vectorized Environment Creation:

We implemented a vectorized environment to improve training efficiency. Specify the number of parallel environments (n_envs) and the maximum length of each episode (episode_max_len). episode_max_len does not affect the mixed_precision_paper environment as its episode length depends directly on the network.

Model Training and Evaluation:

Define key parameters in the model training process, such as total time steps (total_timesteps), evaluation frequency (eval_freq) and model saving frequency (save_freq). This section also involves the initialization of callback functions for training monitoring, model saving, and performance evaluation.

FOR RL Training

command line:: ./ch search --config configs/examples/search_vgg_rl.toml

Use Wandb for monitoring:

If Weights & Biases (Wandb) is integrated into the experiment (by setting wandb_callback = true and adding your wandb entity name in wandb_entity), you can monitor the experiment progress and performance indicators in real time on the Wandb platform.

Evaluate model performance:

The evaluation logic set in the callback function to regularly evaluate model performance and save the best model. By analyzing the saved model and the performance data recorded by Wandb, an in-depth analysis of the algorithm effect can be performed.

Save models and data visualizations:

After the experiment is completed, the final model will be saved to the specified path. You can use Wandb and/or Tensorboard for data visualization to facilitate analysis of model performance and various indicators during the training process.

This PR also includes:

Support for a subset of the CIFAR10 dataset: cifar10_subset
Support for a tiny VGG: vgg_tiny
A test for the RL method
The average bandwidth estimation now also takes Conv2D layers into account.

DeepWok / mase

Team 16: RL Search: #87