Open dubbl opened 2 years ago
After some troubles with getting pycorpora to work the current state is still pretty close to the first one. I have added some more structure to the output, made a Novel and Chapter class, and added the first pseudorandomly picked adjective to the title. To be able to have deterministic output for the same repository (and state) I use the latest commit hash to seed the PRNG and added an optional command line argument to override it.
I also read the Wikipedia page on Natural Language Generation, being pointed there by Ehud Reiter's blogpost How do I Learn about NLG?. I would like to structure my code somewhat following the different stages listed in there.
While looking to debug my packaging issues, I also found that https://github.com/MichaelPaulukonis had the same idea already for the NaNoGenMo 2015! I didn't look too much into it now, to not be influenced, but I will definitely check it out at the end of the month :) At least the preview snipped looked remarkably like my current output, when applied to its own project. :D
The skyrocketed story of novelopment By Novelopment 0.1
All commits dubbl be like setup basic project structure dubbl be like add main.py and parse first repository dubbl be like add apache 2.0 license dubbl be like add more structure to novel output dubbl be like use pycorpora for random adjective in title dubbl be like add README.md dubbl be like add seeding of PRNG
64 words and counting!
Its been 14 days since my last update, because I didn't write a log last weekend - but that doesn't mean no code was written!
Last weekend I mainly worked on the "repository miner" and "content determiner". The repository miner basically goes through all commits, categorizes them (by size, whether it's a "fix" commit or not) and tracks who authored and committed them. This part is not really part of the Natural Language Generation phases, because they assume that you already have your data ready.
The next phase, "Content determination" uses the mined data to identify key events (commits), important dates and the protagonists of the story. In my case that are the dates of the first commits, the first time that somebody else contributed to the project, the first person to commit something, etc. Right now the output is just a simple dictionary - I might come up with something better later, but it works for now.
Today I worked on 4 stages further down the pipeline: Document planning, aggregation, lexical choice, and realization (mostly the first and the last one though).
Document planning means to transform the determined content into a story that makes sense. My document planning is pretty basic right now: I let the novel start with a chapter on the beginnings of the repository - looking at the important event of the first commit and the first committer. It is here where I first introduce the concept of a Sentence - even though these Sentences don't have to end up as actual sentences in the novel, because of...
Aggregation describes the merging of output that is similarly structured, to make reading it easier and avoid repetitions. For example:
dubbl be like setup basic project structure dubbl be like add main.py and parse first repository
could be become
dubbl be like setup basic project structure and add main.py and parse first repository
Currently my aggregation is in my realizer (probably need to move it), and just checks if the next sentence happens on the same day - in that case it merges the two sentences and removes the (now repeated) time definition from the second sentence.
Lexical choice is also basic, but existing: I created a dictionary with synonyms for some of the words I use. By defining the synonyms manually I can be reasonably sure that the end-result still makes sense.
Finally the realizer: Here I use https://github.com/bjascob/pySimpleNLG to convert my self-made sentence structure into actual human readable sentences. This is probably the most hard-core NLG stuff for me that I struggle with a bit, but I'm also (re-)learning some grammatical concepts that I didn't have to think about since primary school. Also, often when I misuse a feature, funny output is generated - like this Gollum-esque piece of art: "Novel starts. Dubbl authorses the first commit."
The dimly remarkable story of novelopment
By Novelopment 0.1
Introduction
While you may have been enticed to grab this book because of its title "The dimly remarkable story of novelopment", this is actually the story of 1 developer who came together to build novelopment.
Humble beginnings
The saga started when dubbl authored the first commit in 2022-11-03.
IT IS DONE!
With the last commit at Wed Nov 30 23:58:02 2022 +0100 my NaNoGenMo 2022 is over.
It might be best to tell my story through the Novelopment novel generator - it was made for this after all:
The bitterly excellent story of novelopment
Novelopment 1.0
Introduction
While you may have been enticed to grab this book because of its title "The bitterly excellent story of novelopment", this is actually the story of 1 human building novelopment in 35 commits.
Humble beginnings
The saga started whilst first time contributor dubbl authored a commit with the message "setup basic project structure", a commit with the message "add main.py and parse first repository" and a commit claiming to "add apache 2.0 license" on Thursday, November 3rd 2022. Around 3 days down the road on November 6th 2022 dubbl authored a commit described as "add more structure to novel output", a tiny commit claiming to "use pycorpora for random adjective in title", a tiny commit with the message "add README.md" and a tiny commit claiming to "add seeding of PRNG".
Working on it
On Sunday, November 13th 2022 aforementioned dubbl created a tiny commit called "add handling of title for local repositories", a commit claiming to "add and use black code formatter", a commit claiming to "data mine the repository for events (commits) and their actors", a tiny commit described as "introduce simplenlg to pluralize actor_word", a tiny bug fixing commit with the message "fix linting issues", a commit called "add initial content determiner", a tiny commit called "update readme with new parameters" and a tiny defect fixing commit called "rename src to novelopment, fix logging". About 1 week later on November the 20th 2022 they crafted a commit with the message "add basic document planner and realizer" and a commit claiming to "add very basic sentence aggregation in realizer". More than 1 week later on November 28th 2022 dubbl composed a commit called "start work on aggregator", a commit called "add complementizer handling to realizer" and a commit claiming to "handle multiple complements/objects". The very same one wrote a commit claiming to "add aggregation on time and cue phrase support", a tiny commit claiming to "handle multi-line commit messages" and a tiny commit claiming to "exclude merge commits" on Tuesday, November 29th 2022. Exactly 1 day down the road dubbl adds a commit with the message "add document planning for second committer and the end", a commit described as "start referring expressions generator", a commit called "move get_word to lexicon", a commit called "add entity description generator", a commit with the message "add time expressions", a tiny commit called "conclude the sagas final sentence", a tiny commit described as "detect and describe reverting commits", a commit with the message "updates deps, add jinja2", a commit called "add html rendering", a commit claiming to "add ebooklib dependency" and a commit with the message "add epub export option" on Wednesday, November 30th 2022.
The end (for now)
On November 30th 2022 the previously mentioned dubbl composes a tiny commit with the message "set version to 1.0" and for now the coverage concludes.
As you can see - a lot of activity in the last 3 days! But not quite enough to reach the 50k words threshold... but fear not!
My main repository for testing (when I needed to test on a larger repository than Novelopment itself), was the self-hosted selfoss RSS-Reader - but that story also reaches just ~36k words currently.
As for a 50k words novel, I chose the python web framework pallets/flask. 4000 commits are enough to tell a 57,890 words novel with 359,519 characters on 139 pages. We made it!
Exported as a epub from Novelopment 1.0, and available for download as a PDF, here is "The brightly wondrous story of flask".
📎 The brightly wondrous story of flask.pdf 📜
(Github doesn't allow .epub attachments)
Stardate: End of NaNoGenMo 2022
Since my last dev log I moved the aggregation of similar sentences into it's own module. I am aggregating sentences with the same subject, predicate and time, but different "object", into one. So "X did Y at T. X did Z at T." becomes "X did Y and Z at T". In addition to that it merges sentences that have a close time relation into one sentence and connects both with "when" or "while". The output is not perfect, but it strings actions together.
I was struggling a bit with the document planner: It lays out the structure of the content that we want to convey, based on the content the content determiner deems relevant. Every story has between 4 and 5 chapters, depending on whether it's just one developer or multiple. After the "Introduction" it always starts with "Humble beginnings" for the first commits, followed by "Working on it" and finally "The end (for now)". But if at some point another contributor shows up, a "Two is a crowd" chapter gets prompted. I think that is actually one of the most important and beautiful events in the story of a software project 🥲 This event can happen at any point during the novel of course, so handling that was a bit tricky.
Finally I added the expression generator. It contains 3 steps:
Finally I added the epub exporting option. epublib makes it really easy to create epub files from Python.
That's it! Thanks for this year, it was loads of fun. Tons of ideas that I couldn't implement because of the time constraints, but the constraint is also what makes this so fun of course. See you next year hopefully! sleeeeeeeeeeeeeep 😪
Congratulations!
And oh, I'm in this story!
On October 25th 2017 first time contributor hugovk added a tiny commit with the message "Remove IRC notifications".
I've generated stories for https://github.com/python-pillow/Pillow/: pillow.txt (149k words) and https://github.com/python/cpython: cpython.txt (1.7m words).
The latter begins:
The patiently electrifying story of cpython By Novelopment 1.0
Introduction While you may have been enticed to grab this book because of its title "The patiently electrifying story of cpython", this is actually the story of 2313 contributors coming together to build cpython in 103021 commits. Humble beginnings The book started whilst first time contributor Guido van Rossum added a commit with the message "Initial revision" on August the 9th 1990. More than 4 weeks down the road on September 10th 1990 Guido van Rossum added a tiny commit called "Warning about incompleteness.". More than 1 week later on September the 18th 1990 they themselves authored a tiny commit called "Renamed intro and modules to tut and mod; added tbl to pipeline.".
Everyone knows that developing software using version control systems is one of the most breathtakingly exciting activities humans ever came up with. 🤩
So why not use git repositories as the base story line for novels, some more epic, some less? Every commit tells a story - stories of features and failures, of linting and loathing, resets and reverts. And where there is absolutely no story in sight the "author" might embellish a couple things here or there...
This is what I will attempt in https://github.com/dubbl/novelopment 😬
The current state after my first NaNoGenMo development evening, when applied to its itself:
What a read of 23 words! Just 49977 more to go!