NaNoGenMo / 2016

National Novel Generation Month, 2016 edition.
https://nanogenmo.github.io
162 stars 7 forks source link

The Business of .com Domain Names #127

Open SwartzCr opened 7 years ago

SwartzCr commented 7 years ago

Whew - I definitely made ...something... Repository: https://github.com/SwartzCr/nanogenmo Generated Novel: https://github.com/SwartzCr/nanogenmo/blob/master/generated_novel.md

The novel takes the structure and sentences from The Business of Domain Names and then fills it in using a markov chain. The corpus of the markov chain comes from the list of all .com domain names as presented in Daniel Temkin's Internet Directory. The result should be sentences that look similar to randomly generated domain names.

The code for splitting domain names into a form suitable for converting into a markov chain is not included. I used Peter Norvig's n-gram code to split domain names into sentences, and then split the sentences into tri-grams, sorted them using the unix sort command, and then counted their frequency using the unix uniq command. From there I squashed them into a pair array (as suggested by Darius Bacon) as opposed to the normal implementation of a nested dictionary so that it would fit in memory. (the input file of trigrams was around 1.6GB of text, the pair array was similarly sized, but the projected size of nested dictionaries was over 8GB).

The final novel has some weird words in it. This is partially an artifact of a few things. First off, the n-gram splitting code I used is probabilistic and doesn't guarantee correct sentence splitting. Secondly, the source domain names aren't even guaranteed to have actual words in them, as opposed to being just random strings. Third, the corpus of words used to generate the probabilities used to split these domain names also comes from the internet, so there's no guarantee that those are words either. That said, if a word looks weird, feel free to look it up in the original corpus, and you may be surprised to find that it is actually a real domain: http://internetdirectory.info/