Closed 0x000011b closed 1 year ago
I have a feeling that this step is going to be the key towards approaching CAI, especially the idea of injecting external knowledge...
Another report I got: model calling the user random unrelated names. Might be worth running some NLP toolkit over the data to see if there's any significant imbalances towards certain names, and try to clean that up.
As for non-conversational data: a recent paper by Google seems to indicate that starting from an instruction-tuned model rather than a regular pre-trained LM might actually improve downstream task performance when performing further fine-tuning. Twitter thread about this, relevant arxiv and code repository.
With respect to more data sources, some thoughts:
I have time to help out with either of these if it would be useful.
Hey @lloorree! Indeed, forums seem like they might be a good source. The community has contributed around ~350MB of forum posts that I'll attempt to write some parsing code for. It's all in SQlite databases, so the code will be somewhat similar to the Discord DHT parsing stuff I wrote. If you're interested in helping out with that, let me know and I can send you a sample.
As for TV stuff, it didn't seem that great to me because a big portion of the context is the stuff that's happening on screen, so if you take just the text/dialogue, it's usually pretty bland and uninteresting.
Hey! That would be perfect, thanks. The main other thing I'd need to get started would be a few lines of one of the correctly parsed output files to double-check that what I'm getting running locally is right.
Fair enough about the dialogue.
@lloorree: I created an issue to track the implementation of the forum datasets (#4), let me know if you're interested in giving it a shot, if you reach out to me on Matrix I can help get you situated with the data-toolbox
repo and share some example files.
Old discussion, very informative, but closing so that I can tidy up this repo
I haven't been able to make any significant improvements on the models by twiddling around with hyperparameters and training objectives ever since around experiment 2, so I'm going to shift into focusing on improving the training data instead.
Some relevant points to consider: