PygmalionAI / data-toolbox

Our data munging code.
GNU Affero General Public License v3.0
34 stars 9 forks source link

Implement VN + VNDB data handling #3

Closed 0x000011b closed 1 year ago

0x000011b commented 1 year ago

Summary

Scope of this task is to implement support for Visual Novel data in the data-toolbox, augmenting it with external information sourced from VNDB.

Source file formats

Each VN will be comprised of one or two files. Assuming {title} as the VN's title, there will be a mandatory {title}.txt file which contains the actual script text. Here's a made-up example:

{name of character}: {text said by character}
{name of character}: {text said by character}
{name of character}: {text said by character}
some narration text
================================================================================
{name of character}: {text said by character}
{name of character}: {text said by character}
narration text
{name of character}: {text said by character}

The sequence of === characters separate episodes from each other.

The VN might optionally also have a {title}.chars.json file, where each key is the name of a character seen in the .txt file, and their VNDB character ID. An example:

{
 "name of character": "c67681",
 "name of character": "c52103",
 "name of character": "c11620"
}

Implementation details

A VisualNovelDataset class should be implemented under toolbox/datasets/visual_novels.py, following the general format of the other datasets. It should yield individual episodes (a.k.a. sequences of dialog that have been separated by the === lines), accompanied by the relevant characters if a matching .chars.json is found. Feel free to structure this how you feel is best, but I recommend basing the implementation off any of the other datasets in that folder.

A VisualNovelPDM should then be implemented under toolbox/modules/visual_novel_pdm.py. Again, basing off of an existing PDM is likely a good call - I'd suggest looking at LightPDM. The catch here is that the generated Episodes should contain persona data whenever possible. The way this should be done is by using the VNDB character IDs specified in the matching .chars.json file to look up character information in the VNDB databases. These are made available for download here.

What specific character data to include is still undecided, we can discuss this here or in the Matrix.

TearGosling commented 1 year ago

Will try to get started on this today. I've never used SQL at all before though, meaning someone else may need to handle the searching and fetching.

TearGosling commented 1 year ago

I've opened PR #8 to handle this, pending review. Due to the fact that the function to grab personas from the VNDB is not yet implemented, I'm keeping this task under "in progress" rather than moving it to "under review".

TearGosling commented 1 year ago

Very old issue, inactive - closing for now.