Improve training data - Githubissues

0x000011b commented 1 year ago

I haven't been able to make any significant improvements on the models by twiddling around with hyperparameters and training objectives ever since around experiment 2, so I'm going to shift into focusing on improving the training data instead.

Some relevant points to consider:

[x] I'd like to improve handling of example dialogue.
- As of now this is being done in a really stupid way: during training the data is mostly discarded, and during inference time it's handled as regular chat history. Obviously not ideal.
[x] I'd like the model to stick closer to the example dialogues, even if the user responds in a completely different format.
- As of now, if you give a character a short greeting, it'll get stuck responding with short messages.
- If you add some example dialogue where the character is very descriptive and detailed, it'll stay that way for the first few messages and then degrade to short responses again as the example dialogue is pushed out of the chat history.
- Ideally, I'd like characters to follow the format in their example dialogue more closely even after a lot of conversation.
[x] It might be worth looking into new data, even if it's non-conversational.
- I'm thinking this might help the model generate more creative and interesting responses, given that dialogue datasets are usually boring as hell (hey. how have you been? good. thanks. ok nice talking to you)
[x] It would be nice to add some special tokens to be able to inject external knowledge that the model could use to ground its responses on.
- That way, Kobold users can make use of Author's Notes and World Info, and we could make use of internet search/long-term memory stores/whatever else on the official service.
- This might imply looking up new datasets to add (BlenderBot 3 paper might be useful here thanks to the retrieval and grounding modules + their relevant datasets) or generating synthetic data, so this is more of a long/medium-term goal instead.

Silver267 commented 1 year ago

I have a feeling that this step is going to be the key towards approaching CAI, especially the idea of injecting external knowledge...

0x000011b commented 1 year ago

Another report I got: model calling the user random unrelated names. Might be worth running some NLP toolkit over the data to see if there's any significant imbalances towards certain names, and try to clean that up.

0x000011b commented 1 year ago

As for non-conversational data: a recent paper by Google seems to indicate that starting from an instruction-tuned model rather than a regular pre-trained LM might actually improve downstream task performance when performing further fine-tuning. Twitter thread about this, relevant arxiv and code repository.

lloorree commented 1 year ago

With respect to more data sources, some thoughts:

Non-conversational sources could be parsed into conversational ones. Having used commoncrawl before it would be feasible to choose some set of forums/social media/et cetera sites and parse it out as, for example: the original post in the forum is the scenario, and each user is a speaker. This would give longer, more detailed text than a lot of casual conversational datasets as well.
Television and movie scripts. There are a lot of publicly available ones, and for TV episodes blurbs could be the scenario. With way more time commitment you could even start going video -> generated script -> input data. I’ve been playing with this with the full scripts of a show and even having a full series in the character definition json helps a lot.

I have time to help out with either of these if it would be useful.

0x000011b commented 1 year ago

Hey @lloorree! Indeed, forums seem like they might be a good source. The community has contributed around ~350MB of forum posts that I'll attempt to write some parsing code for. It's all in SQlite databases, so the code will be somewhat similar to the Discord DHT parsing stuff I wrote. If you're interested in helping out with that, let me know and I can send you a sample.

As for TV stuff, it didn't seem that great to me because a big portion of the context is the stuff that's happening on screen, so if you take just the text/dialogue, it's usually pretty bland and uninteresting.

lloorree commented 1 year ago

Hey! That would be perfect, thanks. The main other thing I'd need to get started would be a few lines of one of the correctly parsed output files to double-check that what I'm getting running locally is right.

Fair enough about the dialogue.

0x000011b commented 1 year ago

@lloorree: I created an issue to track the implementation of the forum datasets (#4), let me know if you're interested in giving it a shot, if you reach out to me on Matrix I can help get you situated with the data-toolbox repo and share some example files.

TearGosling commented 1 year ago

Old discussion, very informative, but closing so that I can tidy up this repo

PygmalionAI / data-toolbox

Improve training data #2