Persistent Variance in IMDB

In running experiments on IMDB, I found that there was a very high variance in validation and test set results and I don't fully understand it, so I'm looking for some advice.

Here, I've run PPO for 10 seeds using default hyperparameters

First of all, its clear that

there is clearly a large variance in performance at epoch 0, which could be explained by randomness in the eval sampling during decoding
there is a large variance in performance at epoch 50, which could be explained by randomness in RL

But together, we see runs that perform best at epoch 0 generally perform best on perplexity at epoch 50, which I can't explain. Here's the top 5 and bottom 5 based on initial perplexity scores, plotted against each other

Given that all models should be initialized to the pretrained model, there should be no randomness in initialization. So I'm confused as to how this is possible. Getting a lucky random seed for the initial validation should not affect the random seed for RL training, so why does the model that performs best at epoch 0 generally perform best at epoch 50?

Finally, I think the variance in results is high enough that I would recommend using 10 seeds for RL4LMs experiments

allenai / RL4LMs

Persistent Variance in IMDB #37