I string words together from the titles of scientific papers using Markov chains. Each word is sampled based on the probability that it follows the preceding word (i.e. I am a bigram model).
So far, I tweet about three kinds of titles:
Additionally, @noamross thought it would be funny to create @HarrisBot, which tweets about whatever @davidjayharris tweets about. This repository contains a model based on @kara_woo's tweets as well.
In general, the machine learning titles are harder to distinguish from real titles, but the ecology titles can be much funnier (see below). Real "creation science" is, of course, indistinguishable for Markov chain output.
"an excellent example for why "no one has ever studied X" is not a sufficient reason to do a study!"
"I actually would like to read this one. Would you mind writing it?"
ML_bigram = load_bigram("data/StatMLTitles")
replicate(5, generate_title(bigram = ML_bigram))
## [1] "structured signal processing with missing data"
## [2] "determining full conditional sparse gradients with distributional estimates"
## [3] "a new york workshop on grouse and their contextual bandits"
## [4] "randomized kaczmarz algorithm and response data"
## [5] "learning with applications to colombian conflict analysis"
ecology_bigram = load_bigram("data/plos_ecology")
replicate(5, generate_title(bigram = ecology_bigram))
## [1] "nowhere to predict the composition on population persistence of smart urban environment"
## [2] "two constraints are not infection"
## [3] "randomization modeling of the short-lived annual forb dominated forests"
## [4] "climate change in the himalaya: water and indigenous burning or increase with the high-throughput sequencing"
## [5] "radiographs reveal unexpected fine-scale analysis of biodiversity"
answers_bigram = load_bigram("data/Answers_Research_Journal")
replicate(5, generate_title(bigram = answers_bigram))
## [1] "numerical simulation of peer review of any kind exist before the dodwell hypothesis"
## [2] "adam, free choice, and unification theory for studies"
## [3] "numerical simulations of retroviruses"
## [4] "numerical simulation of precipitation in yellowstone national park with a warm ocean"
## [5] "more abundant than stars"
harris_bigram = load_bigram("data/davidjayharris")
replicate(5, generate_title(bigram = harris_bigram))
## [1] "@srsupp you meant to anything today:"
## [2] "@johnmyleswhite does #rstats will take."
## [3] "@algaebarnacle @rstudioapp is there a typical to the word in #rstats' matrix multiplies"
## [4] "apple could still valuable. we live in daphnia magna. delightful work, documentation, popularity...)."
## [5] "@kara_woo @algaebarnacle"
woo_bigram = load_bigram("data/kara_woo")
replicate(5, generate_title(bigram = woo_bigram))
## [1] "@alexhanna less of trying to get any reason i get a long, multi-state road trip to was going up on an unrelated note, i'm going to shame."
## [2] "@polesasunder talk on reaching quadruple-digit tweets."
## [3] "@queerscientist oh but no sticker to recruit me a lovelier day of negging *shudder*"
## [4] "@ansonmackay definitely should!"
## [5] "@bashir9ist @markcc @rachelapaul @dr24hours @mbeisen not for an account in ca."
The code is available under The Artistic License 2.0 (see LICENSE
).
The machine learning titles in the "data" folder were scraped by Philippe (@PhDP) from ArXiv and are available under a Creative Commons Share Alike license (some of them are CC-BY).
The ecology titles were scraped from PLOS journals using rplos. These titles are all CC-BY.
The Answers titles are copyrighted by Answers In Genesis. Their inclusion and transformation is not an infringement of copyright in the United States, however, as they are covered by the fair use doctrine.
The HarrisBot data are @davidjayharrs's tweets, minus retweets. These are hereby released as CC-BY.
Kara Woo's tweets are used with her permission.