Implement Reinforcement Learning with Inhuman Feedback

abrichr commented 1 year ago

Feature request

https://github.com/CarperAI/trlx

Motivation

https://twitter.com/i/web/status/1668337702440165376

https://www.youtube.com/watch?v=DxInMGIvkp0

LaPetiteSouris commented 1 year ago

Step 1 : Build sample dataset

From several sample of recording in the database, build a sample dataset including reference_window_dict, reference_action_dicts, active_window_dict for each event in the recording. (if we have a standard test dataset for this, we can reuse, otherwise we have to create)

Step 2 Define a metric to decide if ther is actually an improvement in the model.

Anything would do, I suggest to pick the simplest one. However, my gut feeling is that we need to base this metric on our product. Our dataset has system events of windows, actions, active window. If we choose a random metric for LLM, that may be way off as those metrics are measured based on different type of data.

Package like lm-eval can be used to evaluate the model, however, as the task we have are very specific with specific dataset, the key point is still to have pre-trained dataset as baseline for reference.

Step 3: Design a reward function.

This is the key part of this task as in RF without human feedback, it is the reward function which plays the role of a human trainer to imrpove the outcome.

This part involve a lot of try and test/try again until we see improvement in the metrics (defined in 2)

Document the whole process so that any one in the team can reproduce and continue.

This task looks open-ended, so may be the best way is to time-box it to avoid tunnel-effect.

abrichr commented 1 year ago

Thank you @LaPetiteSouris ! Sounds like we're on the right track.

Step 1 : Build sample dataset

I think we can implement everything before worrying about building the dataset per se. What we need is code that converts one or more recordings into a trainable dataset, e.g. in openadapt.ml.data. Once we've built out the rest of the pipeline we can think about increasing the size of the dataset, but keeping it small and focussed can be helpful to start. You can create one yourself just by creating a recording of you doing a short repetitive sequence of steps e.g. 5 times.

Step 2 Define a metric to decide if ther is actually an improvement in the model.

Using an off the shelf package sounds good. To clarify, are you referring to https://github.com/EleutherAI/lm-evaluation-harness ? My understanding is that we want to evaluate autoregressively. I believe this means that, given a recording from time t_i to t_k, we can train the model on t_i to t_j (where t_i < t_j < t_k), evaluate its ability to predict t_j + 1, and repeat for every t_j < t_k - 1. Please confirm your understanding, and let me know if anything is unclear 🙏

Step 3: Design a reward function.

I think we don't need this if we use a prompt-completion dataset, as per https://github.com/CarperAI/trlx#using-a-prompt-completion-dataset , but I may be wrong. Can you please confirm?

Excited to see what you come up with!

LaPetiteSouris commented 1 year ago

Step 1 All good. If we are OK with using localized/miniature recording as dataset, I am all fine. This limits the scope of the task, which is in general a good things to make it happen.

Step 2

I guess what you describe would be done directly in step 1 (or some short), to prepare a pre-training dataset by separating the records to sections of 70-15-15 (training set-validation set-test set). I don't know yet if the library itself already handle the split, but it's a detail. If it is not the case I will split the dataset myself, should not be a big deal.

Let's see the performance output given by lm-evaluation-harness and we can have materials for discussion.

Step 3

If we use prompt completion dataset, we don't need reward function. But if I get the idea correctly, we want kind of like "AI to train AI" , which is different from RF with human feedback. In the article by Yam Peleg, he is using a reward function in substitution to prompt-answer dataset.

If we go with prompt dataset, we can of course pass the a few records (after processing) as prompt data for training. Otherwise, we have to come up with a reasonable reward function. Of course this is a bit harder.

@abrichr Thoughts ?

abrichr commented 1 year ago

separating the records to sections of 70-15-15 (training set-validation set-test set)

This is typical of supervised learning tasks. Since this is semi-supervised / autoregressive, I believe we will train on the full set. If you disagree can you please point me to some documentation or literature that supports type of training split when training transformers?

In the article by Yam Peleg, he is using a reward function in substitution to prompt-answer dataset.

Can you please clarify here what exactly he says about this and where? 🙏

Edit: moved to https://github.com/OpenAdaptAI/OpenAdapt/issues/414

abrichr commented 1 year ago

Rlaif: https://www.anthropic.com/index/constitutional-ai-harmlessness-from-ai-feedback

LaPetiteSouris commented 1 year ago

@abrichr

You are correct. I was mistaken. Auto regressive training does not require the full split as the output is calculated from forward loop, hence we can give the full recording as dataset.

For the article on RLAIF and for your question on rewarding model, I think that here is the recap

In both cases of Yam Peleg and Anthropic, they create either a reward function or in a complex case of Anthropic, a reward models based on "guideline principle". I suggest we keep this for the next topic to limit the scope. For now, let's just:

Create sampling dataset from repetitive recording.
Evaluate original scores of the models using lm-evaluation-harness .
Perform RF learning directly on the prompt dataset from (1), measure performance gains using step (2)
Repeat (2) and (3) a while until improvement is observed.

To keep it simple, let's start with a single model at the moment. I suggest trying https://github.com/OpenAdaptAI/OpenAdapt/issues/391 as the model, as it may prepare the road for this ticket as well.

WDYT ?

@abrichr

abrichr commented 1 year ago

@LaPetiteSouris sounds great to me! 🙏

OpenAdaptAI / OpenAdapt

Implement Reinforcement Learning with Inhuman Feedback #393

Feature request

Motivation