Now you can simply import conversation from the script and the whole conversation is saved (minus system prompt) so we have everything we need to fine tune. I made an example directory so people can contribute their own examples. As we get more we can theoretically cluster them and provide models finetuned for specific workflows but for now a flat directory makes sense.
As I was logging conversations I noticed a few quirks which I changed:
The original user prompt was added at the beginning and end of the _continue loop and every time. Now it is only added once. It is no longer annotated "PROMPT"
The actual output of the assistant, the script, is labeled "system". I changed it to assistant. It is important for future finetuning that we only label llm output as "assistant". And that we label ALL llm output we want to finetune as assistant.
The output of the script is labeled "assistant". I don't like any of the labels user/assistant/system but I think "user" is the least bad because from the llm perspective it gave the user a script and it makes sense the user would run it and past output back. It is titled "ERROR" or "LAST SCRIPT OUTPUT" so i think the llm will understand. I prefer to avoid multiple system prompts because not every endpoint supports them.
You can see these changes reflected in the uploaded examples. I don't have a principled way to verify this won't affect performance but the information is mostly unchanged and it was able to do at least those two examples and a few other tests.
Now you can simply import conversation from the script and the whole conversation is saved (minus system prompt) so we have everything we need to fine tune. I made an example directory so people can contribute their own examples. As we get more we can theoretically cluster them and provide models finetuned for specific workflows but for now a flat directory makes sense.
As I was logging conversations I noticed a few quirks which I changed:
You can see these changes reflected in the uploaded examples. I don't have a principled way to verify this won't affect performance but the information is mostly unchanged and it was able to do at least those two examples and a few other tests.