Scope of this task is to implement support for Enjin forum dumps in the data-toolbox.
Source file formats
Source files are SQLite3 databases generated by the encuum utility. Please reach out to me in private (via email, Matrix, Discord, etc.) for example files if you're interested in tackling this implementation.
Implementation details
An EnjinDataset class should be implemented under toolbox/datasets/enjim.py, following the general format of the other datasets. Threads should map to Episodes, and posts within threads should map to Turns within the Episode.
An EnjinVDM should then be implemented under toolbox/modules/enjim_pdm.py. Feel free to look at the light_pdm.py file in that same folder for an example.
A lot of data processing will then need to take place. Off the top of my head:
BBcode will need to be converted to its nearest Markdown representation, or dropped entirely if it's too excessive (e.g. different font colors, images)
Irrelevant threads and posts need to be pruned (e.g. non-roleplay, announcements and so on)
Overly short posts will need to be carefully pruned (usually OOC talk)
...and maybe more. This will probably be the trickiest part. Feel free to reach out, we can discuss these points here or in the Matrix.
Each forum has separate subforums for characters and actual roleplays. The character posts will be cross-referenced to add personas to the prompts.
Posts that are too long will be split up by a character limit and treated as the character speaking multiple times.
The first post in the thread and a summary of the thread so far is the scenario.
Individual prompts will be made by a rolling window over the thread to keep them from being too large.
Too-long character descriptions and scenarios will be run through the philschmid/bart-large-cnn-samsum summarizer, which I found after testing out a handful of summarizers for generating summaries from TV scripts and should work well for this as it's very similar.
At some point instead of summarizing this should be something with vector databases and lookups, but I'm not familiar enough with them to do it this way yet, so TBD.
There don't seem to be images in the dataset, but if there turn out to be a lot of them somewhere they'll be replaced with a constant image tag and a generated description of them.
Summary
Scope of this task is to implement support for Enjin forum dumps in the
data-toolbox
.Source file formats
Source files are SQLite3 databases generated by the encuum utility. Please reach out to me in private (via email, Matrix, Discord, etc.) for example files if you're interested in tackling this implementation.
Implementation details
An
EnjinDataset
class should be implemented undertoolbox/datasets/enjim.py
, following the general format of the other datasets. Threads should map toEpisode
s, and posts within threads should map toTurn
s within theEpisode
.An
EnjinVDM
should then be implemented undertoolbox/modules/enjim_pdm.py
. Feel free to look at thelight_pdm.py
file in that same folder for an example.A lot of data processing will then need to take place. Off the top of my head:
...and maybe more. This will probably be the trickiest part. Feel free to reach out, we can discuss these points here or in the Matrix.