Open chonger opened 1 week ago
Following up on an offline conversation regarding the above:
If we are able to make our own AI formatter, we, and any other user, should be able to quickly and automatically convert a play with whatever formatting into a play formatted for cueballer. The user would need to know how to use AI and pay a small fee, but they would save the two to four hours it takes to format a full script for cueballer. Obvious benefit.
For our purposes, Shakespeare's plays being the core, the benefit would be quickly automating the conversion of the format used by shakespeareswords.com into the cueballer format. If we imagine an end goal of creating downloadable cueballer versions of all 39 plays hosted by shakespeareswords.com in both folio and modern (to be hosted on their site?), that's 78 plays to format. I definitely welcome AI in that process, particularly since we'll still need to go through and assign shared lines and stage directions manually.
This represents our best current proposal for quickly combining shakespeareswords.com's asset of authoritative text with our asset of cueballer.
To begin training GPT-4 with random formats, I've uploaded 11 scenes from different plays in the public domain in a new branch "random-scenes", with the goal of reaching 50.
First- am I doing this efficiently and in the correct location?
Second- should I continue to pull as randomly as I have from different websites with different formats, or focus specifically on Project Gutenburg or shakespeareswords.com to train it more specifically? I know we want to check in with the Crystals before running any of their texts through AI.
Hey folks so here is my recommended gameplan for our goal to be able to take arbitrarily formatted scripts and convert them all to the same format of our choosing. We will begin in integrate our solutions to the other issues here (stage directions, group lines, italics) in step 2.
Step one - bring me some plays. Actually, scenes. I need them in plain text format for which the best definition is "does it look right if you open it up in notepad". The reason they need to be scenes is just so that they're short enough - i think we can always produce our data scene by scene. When you have the files, check them into this repo if you can (make a pr? or i can add)
Step two - we write a clever prompt that tricks GPT-4 to convert these plays into our format. This will cost like <5 bucks which I will cover.
Step three - we take that data and use it to do something called LoRA funetuning of lLAMA-3-8B, which i should be able to do on my home rig for free! otherwise it might cost another few bucks, which i will cover.
Step four - we convert a bunch of other data and check to see how well we do. We will compare our finetuned Llama with GPT-4 on a few scenes, and if our method is just as good we've build a very cool thing. If not we will still have a prompt that can be used with GPT-4, although people will need to pay instead of it being free...I'm pretty confident that this will work as a backoff and at least we will know if it does by step 2 anyhow.
Sound good? Send me scenes! Some Shakespeare scenes of course but some more seldom seen scenes would also be so super