Use a script to collect the titles from select/specified subreddits and use this collection to learn and predict the likelihood of a given title.
This is used to predict which subreddit a chosen title should belong too. It calculates the log likelihood through unigram and bigram distributions.
Python Collect_Titles.py [subreddit1] [subreddit2] [...]
Python Generate_Distributions.py [subreddit1] [subreddit2] [...]
Python Find_Best_Fit.py [subreddit1 [subreddit2] [...]
Java RefitDictionary [total file] [output file]
Java CreateDistributions [training file]
Java TitleModeling [subreddit] ["phrase to check likelihood for"]
Java SimWrapper [subreddit] ["phrase to check likelihood for"]
SimWrapper (Java)
Collect_Titles (Python)
Generate_Distributions (Python)
Find_Best_Fit (Python)
CreateDistributions (Java)
RefitDictionary (Java)
TitleModeling (Java)
Additional Files
The script acesses reddit through PRAW. Reddit requires bots to not make requests more than 30 requests per minute so it is built in to make a request for data every 2 seconds. Each request can fetch 100 posts. Reddit also only caches the top 1,000 posts per time range (new, hot, top).
This set of programs will generate the data used for the prediction into text files, if run to its full extent it can generate for a list of N subreddits. This as of now generates 3 files per subreddit: Dictionary/Training File, Unigram Distribution, and Bigram Distribution. As to not clog up the file Collect_Titles being the first one to run attempts to also create the folder Training_Files to store all of these.
When running the Python scripts if you pass in "top" as the only parameter it will scroll through the list of the top subreddits. This list was gathered from http://redditlist.com/ and copy and pasted in a file called "TopSubreddits.txt" within the Training_Files folder. A script in Converter converts it from its nasty formatting into a list for the python scripts and Java apps to use.
This "top" parameter will allow you to collect and learn from as much data as you want.
The program makes efficient use of hashtables to model and learn from the distributions and to perform the prediction. The time cost of it has been drastically cut down since the prototype. However the stall is largely due to: 1) Accessing Reddit itself for the data as their are caps on requests per minute 2) Creating the distributions from the initial data
Moving from Scanner to BufferedReaders as well as a wrapper for creating the Distribution has massively increased the speed of program. As of now once the distributions are created the simulation can run through the top 250 subreddits and determine order the likelihood of each in under 1.5 seconds.
This was inspired by a desire to practice Python, a personal desire to see if its possible to model and predict titles off of Reddit, and make combine this with an algorithm learned through my studies at UCSD. The algorithm was developed for a Artificial Intelligence class via probabilistic reasoning.