I would like to implement self-play dialogue training.
For that I guess I need to modify episode rollout process by adding formatting like speaker id on the start of each line. I'd also like to try holding some model buffer of previous checkpoints and use them as one of the conversants to avoid model overfitting to itself.
The obvious place for it is implementing a new policy that provides formatted generation results and holds previous checkpoints in the buffer.
Is there any better place to implement this? Anything I should consider library-wise while implementing it?
Any advice would be appreciated, thanks in advance!
Hello
I would like to implement self-play dialogue training. For that I guess I need to modify episode rollout process by adding formatting like speaker id on the start of each line. I'd also like to try holding some model buffer of previous checkpoints and use them as one of the conversants to avoid model overfitting to itself.
The obvious place for it is implementing a new policy that provides formatted generation results and holds previous checkpoints in the buffer.
Is there any better place to implement this? Anything I should consider library-wise while implementing it?
Any advice would be appreciated, thanks in advance!