This PR will add the Claude Evol-Instruct dataset as an instruction-following task, consisting of over 210,000 examples. This was relatively simple to process, besides having to handle (on basically an individual-example level) the name "Claude" being inserted everywhere and some indirect refusals ("I don't actually have...").
To be honest, after looking at this dataset, I'm not entirely sure if it's top-quality. Claude might just not be smart enough to be able to answer more complicated instructions, and the data too is somewhat messy. Regardless, I'll likely include this in V5, and if the resulting model is dumber than our current Pyg2, we can just exclude this task from future versions of the dataset later.
In other news, I missed a few errors in the system prompting and it seems like certain prompts are more prominent than others. I'll fix that in upcoming commits directly on the feat/experimental-data-format branch.
This PR will add the Claude Evol-Instruct dataset as an instruction-following task, consisting of over 210,000 examples. This was relatively simple to process, besides having to handle (on basically an individual-example level) the name "Claude" being inserted everywhere and some indirect refusals ("I don't actually have...").
To be honest, after looking at this dataset, I'm not entirely sure if it's top-quality. Claude might just not be smart enough to be able to answer more complicated instructions, and the data too is somewhat messy. Regardless, I'll likely include this in V5, and if the resulting model is dumber than our current Pyg2, we can just exclude this task from future versions of the dataset later.
In other news, I missed a few errors in the system prompting and it seems like certain prompts are more prominent than others. I'll fix that in upcoming commits directly on the
feat/experimental-data-format
branch.