PygmalionAI / data-toolbox

Our data munging code.
GNU Affero General Public License v3.0
34 stars 9 forks source link

Implement data handling for RP forum dumps #4

Closed 0x000011b closed 1 year ago

0x000011b commented 1 year ago

Summary

Scope of this task is to implement support for Enjin forum dumps in the data-toolbox.

Source file formats

Source files are SQLite3 databases generated by the encuum utility. Please reach out to me in private (via email, Matrix, Discord, etc.) for example files if you're interested in tackling this implementation.

Implementation details

An EnjinDataset class should be implemented under toolbox/datasets/enjim.py, following the general format of the other datasets. Threads should map to Episodes, and posts within threads should map to Turns within the Episode.

An EnjinVDM should then be implemented under toolbox/modules/enjim_pdm.py. Feel free to look at the light_pdm.py file in that same folder for an example.

A lot of data processing will then need to take place. Off the top of my head:

...and maybe more. This will probably be the trickiest part. Feel free to reach out, we can discuss these points here or in the Matrix.

lloorree commented 1 year ago

You can assign this to me. I'll be working on it today. I'll follow-up in Matrix and put a summary here when that's done.

lloorree commented 1 year ago

Summary of the discussion:

lloorree commented 1 year ago

PR #9 is for this.

TearGosling commented 1 year ago

Very old issue, closing for now.