allenai / RL4LMs

A modular RL library to fine-tune language models to human preferences
https://rl4lms.apps.allenai.org/
Apache License 2.0
2.18k stars 191 forks source link

Persistent Variance in IMDB #37

Open mnoukhov opened 1 year ago

mnoukhov commented 1 year ago

In running experiments on IMDB, I found that there was a very high variance in validation and test set results and I don't fully understand it, so I'm looking for some advice.

Here, I've run PPO for 10 seeds using default hyperparameters

image

First of all, its clear that

  1. there is clearly a large variance in performance at epoch 0, which could be explained by randomness in the eval sampling during decoding
  2. there is a large variance in performance at epoch 50, which could be explained by randomness in RL

But together, we see runs that perform best at epoch 0 generally perform best on perplexity at epoch 50, which I can't explain. Here's the top 5 and bottom 5 based on initial perplexity scores, plotted against each other

image

Given that all models should be initialized to the pretrained model, there should be no randomness in initialization. So I'm confused as to how this is possible. Getting a lucky random seed for the initial validation should not affect the random seed for RL training, so why does the model that performs best at epoch 0 generally perform best at epoch 50?

Finally, I think the variance in results is high enough that I would recommend using 10 seeds for RL4LMs experiments

rajcscw commented 1 year ago

Also can you tell the exact mean and sd of these runs and the corresponding config? We can double-check from our side too.