RLHF to train reward model?

PWhiddy / PokemonRedExperiments

Playing Pokemon Red with Reinforcement Learning

MIT License

7k stars 644 forks source link

RLHF to train reward model? #101

Open Iron-Bound opened 1 year ago

Iron-Bound commented 1 year ago

In terms of reward function, would we be interested in using RLHF too train a dedicated model for reward? from my research we can do this by either:

Have a human rank the small clips of game play and select the preferred one.

Use video from a Speedrun or human playing live.

Given my training got stuck in OAK's lab for 50 iteration.

I've been thinking how to reward things without hard coding: running away when low on health, avoiding trainers, one way paths, avoid buying that magic carpet, etc..

PWhiddy commented 1 year ago

Do you still get stuck in the lab with the new fast training script? It should get out of there much more quickly.

But yes, I have been thinking a bit about reward modeling / rlhf, and that would be really cool! It certainly would be a very serious amount of work to set up and get working, but could potentially address a lot of challenges, would require a ton of labeling, but opens up the chance to involve a lot more non technical folks who are interested in contributing to the project. Brings back more of the "twitch plays pokemon" elements.

Iron-Bound commented 1 year ago

Do you still get stuck in the lab with the new fast training script?

It's much better now and a welcome surprise 😁

Brings back more of the "twitch plays pokemon"

Sentdex did a GTA 5 bot, with reset function also.

ATM I'm trying to find existing frameworks to do the HF part of this and the closest has been in robotics.

I'm thinking maybe the interactive mode could be modified as well or we could do a sandbox to train Mt moon?

trantrikien239 commented 1 year ago

I think it does not necessarily require a ton of labeling but will need the game to have long-term memory