In running experiments on IMDB, I found that there was a very high variance in validation and test set results and I don't fully understand it, so I'm looking for some advice.
Here, I've run PPO for 10 seeds using default hyperparameters
First of all, its clear that
there is clearly a large variance in performance at epoch 0, which could be explained by randomness in the eval sampling during decoding
there is a large variance in performance at epoch 50, which could be explained by randomness in RL
But together, we see runs that perform best at epoch 0 generally perform best on perplexity at epoch 50, which I can't explain. Here's the top 5 and bottom 5 based on initial perplexity scores, plotted against each other
Given that all models should be initialized to the pretrained model, there should be no randomness in initialization. So I'm confused as to how this is possible. Getting a lucky random seed for the initial validation should not affect the random seed for RL training, so why does the model that performs best at epoch 0 generally perform best at epoch 50?
Finally, I think the variance in results is high enough that I would recommend using 10 seeds for RL4LMs experiments
So each run will have some randomness due to the dataset creation (we randomly select val and test samples) due to the large size of the original dataset.
Additionally, during decoding, there is randomness due to the sampling of tokens (both during epoch 0 and epoch 50)
Randomness in episode generation of PPO
Also can you tell the exact mean and sd of these runs and the corresponding config? We can double-check from our side too.
In running experiments on IMDB, I found that there was a very high variance in validation and test set results and I don't fully understand it, so I'm looking for some advice.
Here, I've run PPO for 10 seeds using default hyperparameters
First of all, its clear that
But together, we see runs that perform best at epoch 0 generally perform best on perplexity at epoch 50, which I can't explain. Here's the top 5 and bottom 5 based on initial perplexity scores, plotted against each other
Given that all models should be initialized to the pretrained model, there should be no randomness in initialization. So I'm confused as to how this is possible. Getting a lucky random seed for the initial validation should not affect the random seed for RL training, so why does the model that performs best at epoch 0 generally perform best at epoch 50?
Finally, I think the variance in results is high enough that I would recommend using 10 seeds for RL4LMs experiments