PygmalionAI / data-toolbox

Our data munging code.
GNU Affero General Public License v3.0
34 stars 9 forks source link

Investigate and possibly include some ParlAI data #5

Closed 0x000011b closed 1 year ago

0x000011b commented 1 year ago

Theoretically, all the data used to train Meta's BlenderBot 3 is available in the ParlAI library (repository). In practice, a significant amount of "teachers" and "tasks" there are broken, so the configurations they've released which are supposed to replicate the BB3 training data doesn't actually work.

Still, a significant amount of teachers can be made to work with small changes, and others work out of the box. We already have plenty of open-ended conversational data, so I'd like to see if we can find some good data for doing external knowledge grounding (so we can use it for long-term memories/internet search/world info/etc.).

silverriver commented 1 year ago

It seems that most abilities of a large pre-trained languge model, such as dialogue generation or external knowledge grounding comes from the pre-training process itself.

That means a good pre-training model is all you need. Maybe you can consider to build a decent pre-training model first.

0x000011b commented 1 year ago

Maybe you can consider to build a decent pre-training model first.

Unfortunately that requires way more resources than I have access to, so for the foreseeable future my plan is to just fine-tune existing pretrained LMs. You're right about this though:

It seems that most abilities of a large pre-trained languge model [...] comes from the pre-training process itself.

The idea here is to get some data that will help make use of those abilities at inference time by injecting external knowledge between special tokens or something of the sorts, so that features like World Info on Kobold can work more reliably for example.

0x000011b commented 1 year ago

Done, data from ParlAI's SearchDialogueGenerationTeacher can now be used to generate training examples.