PygmalionAI / data-toolbox

Our data munging code.
GNU Affero General Public License v3.0
34 stars 9 forks source link

Improve training data #2

Closed 0x000011b closed 1 year ago

0x000011b commented 1 year ago

I haven't been able to make any significant improvements on the models by twiddling around with hyperparameters and training objectives ever since around experiment 2, so I'm going to shift into focusing on improving the training data instead.

Some relevant points to consider:

Silver267 commented 1 year ago

I have a feeling that this step is going to be the key towards approaching CAI, especially the idea of injecting external knowledge...

0x000011b commented 1 year ago

Another report I got: model calling the user random unrelated names. Might be worth running some NLP toolkit over the data to see if there's any significant imbalances towards certain names, and try to clean that up.

0x000011b commented 1 year ago

As for non-conversational data: a recent paper by Google seems to indicate that starting from an instruction-tuned model rather than a regular pre-trained LM might actually improve downstream task performance when performing further fine-tuning. Twitter thread about this, relevant arxiv and code repository.

lloorree commented 1 year ago

With respect to more data sources, some thoughts:

I have time to help out with either of these if it would be useful.

0x000011b commented 1 year ago

Hey @lloorree! Indeed, forums seem like they might be a good source. The community has contributed around ~350MB of forum posts that I'll attempt to write some parsing code for. It's all in SQlite databases, so the code will be somewhat similar to the Discord DHT parsing stuff I wrote. If you're interested in helping out with that, let me know and I can send you a sample.

As for TV stuff, it didn't seem that great to me because a big portion of the context is the stuff that's happening on screen, so if you take just the text/dialogue, it's usually pretty bland and uninteresting.

lloorree commented 1 year ago

Hey! That would be perfect, thanks. The main other thing I'd need to get started would be a few lines of one of the correctly parsed output files to double-check that what I'm getting running locally is right.

Fair enough about the dialogue.

0x000011b commented 1 year ago

@lloorree: I created an issue to track the implementation of the forum datasets (#4), let me know if you're interested in giving it a shot, if you reach out to me on Matrix I can help get you situated with the data-toolbox repo and share some example files.

TearGosling commented 1 year ago

Old discussion, very informative, but closing so that I can tidy up this repo