Identifying Interesting Documents for Reddit using Recommender Techniques
A Reddit agent that will automatically search the web for interesting text content to share to a subreddit it has been trained on. The agent will make use of a number of machine learning and natural language processing techniques to achieve this goal.
The source code is organised into folders as follows:
The code is written for the standard CPython 2.7 interpeter. This is due to certain library requirements.
The following libraries are required (in pip format):
nltk==2.0.4
matplotlib==1.3.1
numpy==1.8.1
nose==1.3.3
MySQL-python==1.2.5
goose-extractor==1.0.20
beautifulsoup4==4.3.2
lxml==3.3.5
argparse==1.2.1
scikit-learn==0.14.1
scipy==0.14.0
requests==2.3.0
These should ideally be installed in a seperate virtualenv to prevent any previously installed libraries from influencing the runtime behavior.
Once a virtualenv has been setup and activated, you can install the above libraries using the following command:
pip install -r requirements.txt
Note that the open source tool WikiExtractor.py is also required but is already included amongst the source code for convenience.
The following files are executable python scripts. In most cases executing the script without any input parameters will display the expected input patterns as a help.
In order to run any code related to Bag of Concepts you first need to index a Wikipedia article corpus. This can be downloaded from Wikipedia in bz2 format here.
You should download the pages-articles.xml.bz2
file (approx 11GB as of August 2014)
Once downloaded you need to setup a MySQL database. Assuming you create a database named "WikiIndex" then create a file in the src directory named db.json
with the following contents:
{
'host': 'localhost',
'user': 'root',
'passwd': 'passwordhere',
'db': 'WikiIndex'
}
Change the above parameters according to your database setup.
Once the Wikipedia dump has been downloaded and the relevant database setup you need to run the wiki setup script to create the database index. Typically this would be done as follows:
python wiki_setup.py pages-articles.xml.bz2 --engine MYISAM
This process takes a very long time (up to 28 hours on my SSD) due to the parsing of 16 million articles of text and saving them to disk as a database.
Once the process is completed then all Bag of Concepts methods should work as expected.
Numerous tests are included amongst the source code to ensure that functionality is running as expected after setup. These tests can be activated using nose in the root directory:
nosetests