Implement data handling for RP forum dumps

0x000011b commented 1 year ago

Summary

Scope of this task is to implement support for Enjin forum dumps in the data-toolbox.

Source file formats

Source files are SQLite3 databases generated by the encuum utility. Please reach out to me in private (via email, Matrix, Discord, etc.) for example files if you're interested in tackling this implementation.

Implementation details

An EnjinDataset class should be implemented under toolbox/datasets/enjim.py, following the general format of the other datasets. Threads should map to Episodes, and posts within threads should map to Turns within the Episode.

An EnjinVDM should then be implemented under toolbox/modules/enjim_pdm.py. Feel free to look at the light_pdm.py file in that same folder for an example.

A lot of data processing will then need to take place. Off the top of my head:

BBcode will need to be converted to its nearest Markdown representation, or dropped entirely if it's too excessive (e.g. different font colors, images)
Irrelevant threads and posts need to be pruned (e.g. non-roleplay, announcements and so on)
Overly short posts will need to be carefully pruned (usually OOC talk)

...and maybe more. This will probably be the trickiest part. Feel free to reach out, we can discuss these points here or in the Matrix.

lloorree commented 1 year ago

You can assign this to me. I'll be working on it today. I'll follow-up in Matrix and put a summary here when that's done.

lloorree commented 1 year ago

Summary of the discussion:

Each forum has separate subforums for characters and actual roleplays. The character posts will be cross-referenced to add personas to the prompts.
Posts that are too long will be split up by a character limit and treated as the character speaking multiple times.
The first post in the thread and a summary of the thread so far is the scenario.
Individual prompts will be made by a rolling window over the thread to keep them from being too large.
Too-long character descriptions and scenarios will be run through the philschmid/bart-large-cnn-samsum summarizer, which I found after testing out a handful of summarizers for generating summaries from TV scripts and should work well for this as it's very similar.
At some point instead of summarizing this should be something with vector databases and lookups, but I'm not familiar enough with them to do it this way yet, so TBD.
There don't seem to be images in the dataset, but if there turn out to be a lot of them somewhere they'll be replaced with a constant image tag and a generated description of them.

lloorree commented 1 year ago

PR #9 is for this.

TearGosling commented 1 year ago

Very old issue, closing for now.

PygmalionAI / data-toolbox