Open Iron-Bound opened 1 year ago
Do you still get stuck in the lab with the new fast training script? It should get out of there much more quickly.
But yes, I have been thinking a bit about reward modeling / rlhf, and that would be really cool! It certainly would be a very serious amount of work to set up and get working, but could potentially address a lot of challenges, would require a ton of labeling, but opens up the chance to involve a lot more non technical folks who are interested in contributing to the project. Brings back more of the "twitch plays pokemon" elements.
Do you still get stuck in the lab with the new fast training script?
It's much better now and a welcome surprise 😁
Brings back more of the "twitch plays pokemon"
Sentdex did a GTA 5 bot, with reset function also.
ATM I'm trying to find existing frameworks to do the HF part of this and the closest has been in robotics.
I'm thinking maybe the interactive mode could be modified as well or we could do a sandbox to train Mt moon?
I think it does not necessarily require a ton of labeling but will need the game to have long-term memory
In terms of reward function, would we be interested in using RLHF too train a dedicated model for reward? from my research we can do this by either:
Have a human rank the small clips of game play and select the preferred one.
Use video from a Speedrun or human playing live.
Given my training got stuck in OAK's lab for 50 iteration.
I've been thinking how to reward things without hard coding: running away when low on health, avoiding trainers, one way paths, avoid buying that magic carpet, etc..