Sport News Retrieval

CZ4034 Information Retrieval Assignment


First, we install all requirements for crawler by using the following command:

$ pip install -r crawler/requirements.txt

The crawler will crawl ESPN, TheNBACentral, SimpleNBAScore, ESPNNBA, and NBATV twitter timeline. To use this crawler, we first need to obtain an API key from Twitter Application Management website. Next, create a file in crawler folder. This file will not be checked in to Git.

# Replace the values with your API key values
consumer_key = 'Your key here'
consumer_secret = 'Your key here'
access_token = 'Your key here'
access_token_secret = 'Your key here'

We run the crawler by using the following command:

$ python crawler/

Five json files, espn_data.json, TheNBACentral_data.json, SimpleNBAScores, ESPNNBA, and NBATV will be created at the data directory of this project.

To count number of words in crawled data:

$ python crawler/


First, we install all requirements for recrawler by using the following command:

$ pip install -r recrawler/requirements.txt

The recrawler is a Django server and rely on the crawler to do the recrawling task for incremental index. When user submits a request to the server, it will perform an asynchronous task to crawl the tweets and send an update request to Solr server. To run the recrawler server, we first need to setup the crawler, as mentioned in previous section. Then, we run the following commands:

$ cd recrawler
$ python migrate
$ python runserver

To test the recrawler, we just need to submit a GET/POST request to http://localhost:8000/recrawler-service/recrawl. The recrawler will return a 200 HTTP reponse immediately and crawling the first 200 tweets from selected accounts asynchronously.


We start Solr 5.0 server and index our data by using the following commands:

$ solr start -s root_of_project/index/solr
$ post -c sport espn_data.json
$ post -c sport TheNBACentral_data.json


First, we install all requirements for classifier by using the following command:

$ pip install -r classifier/requirements.txt

You need to run the crawler at least once, and make sure that espn_data.json file is available in data folder. The pipeline of our classifier is shown in the figure below:

espn_data.json --> --> --> some classifier -->

We will first call the API from to label our raw data. The following script will call the API, an output a json file espn_data_result.json in data folder. espn_data_result.json contains probabilities and labels for data. The API only allows 1 request per seconds, so you might want to grab a coffee while waiting.

$ python classifier/

Example content of espn_data_result.json:

  "probability": {
    "neg": 0.4768910438883407,
    "neutral": 0.8121072206192833,
    "pos": 0.5231089561116593
  "label": "neutral"

Next, run It does preprocessing to the data crawled, and runs 3 classifiers next.

$ python classifier/

The figure folder contains the graphs of the precision recall curve. The metric_result folder contains the evaluation metrics of the classifier and the timing to run the classifier. The model folder contains the trained classifier.

Alternatively, you may run the scripts individually, as shown in the following sections


The preprocessing step will do the following in sequence:

  1. lower case
  2. remove html
  3. remove links
  4. remove mention
  5. remove hashtag
  6. lemmatization and remove stopwords
  7. remove punctuation

Then, it will output the preprocessed data to labelled_tweets.csv and label_api.csv. We can run the preprocess step by using the following script:

$ python classifier/

Example content of labelled_tweets.csv:

"buzzer-beating 3 win crucial bubble game make gary payton happy? #pac12afterdark never disappoints.","diaz: you're steroids mcgregor: sure am. i'm animal. icymi: #ufc196 presser went expected."


After preprocessing step, we will run train some classifiers and evaluate the classifier. Currently, we have three classifier, i.e. linear support vector classification, gensim classifier, and ensemble classifier. By default, the following scripts will use to generate evalutation metrics.

Linear support vector classification


$ python classifier/

Gensim classifier


$ python classifier/

Ensemble classifier


$ python classifier/ 

Inter annotator agreement

The class calls the nltk annotation task.

Classify all crawled data

$ python classifier/

## UI Client

We have a simple user interface that use Solr server to retrieve sport news. Current UI version has two functions:

- A button to trigger the crawling in the backend
- A text area waiting for keywords. The click of search button will trigger a query to backend solr to retrieve records. Then records are displayed in the page. 

To install all components for this website, we first need to install bower using [node package manager](

$ npm install -g bower

Then we install all dependencies using bower:

$ cd UI
$ bower install

You can open the UI/index.html file to view the simple website. Please take note that some functionality may not work if you didn't run Solr server and serve static content from the same domain.

But wait, there are so many things I need to do to run the simple website!

Yeah you are right. You need to install all Python requirements, install all bower components, crawl data, classifier data, host static files using some kind of server, run solr server and index data using solr. Tedious right? Let's magic happens!


First, we need to make sure that your Docker client is connected to your Docker daemon.

# Only Mac OSX can run the following command.
$ docker-machine start default
$ docker-machine env default
$ eval $(docker-machine env default)

# For Windows users, please copy paste the output of previous command to your
# command line. 
$ docker-machine start default
$ docker-machine env default

# For Linux
# Make sure you can run docker without sudo by creating a docker group
$ sudo service docker start

Creating a docker group in Linux

Alternatively, Mac OSX and Windows users can connect to Docker daemon using Docker Quickstart Terminal program.

Next, we run all these tedious steps using the following commands:

$ source

# If you want to crawl, classify, start solr server and website
$ sportd start -cc

# If you do not want to classify, and only want to crawl, start solr server and website
$ sportd start -c

# If you do not want to crawl and classify, and only want to start solr server and website
$ sportd start

# You should only use sportd command in the root directory of this project

After running the commands, the application will be deployed to a virtual machine. We need to know the ip address of our virtual machine using the following commands:

# For MAC OSX and Windosw
$ docker-machine ip default

# For Linux
$ ifconfig docker0 | grep 'inet addr:' | cut -d: -f2 |  cut -d ' ' -f 1

Now, you can visit the website at http://theipaddress and the Solr Admin Panel at http://theipaddress:8983.

To stop the magic from happening:

$ sportd stop

Deploy to Google Container Engine

Build docker images:

$ export PROJECT_ID=sport-news-retrieval
$ export VERSION=v1.1-rc8
$ docker build -t${PROJECT_ID}/proxy:${VERSION} --file proxy/Dockerfile .
$ docker build -t${PROJECT_ID}/solr:${VERSION} --file index/Dockerfile .
$ docker build -t${PROJECT_ID}/recrawler:${VERSION} --file recrawler/Dockerfile .

$ gcloud docker push${PROJECT_ID}/proxy:${VERSION}
$ gcloud docker push${PROJECT_ID}/solr:${VERSION}
$ gcloud docker push${PROJECT_ID}/recrawler:${VERSION}

Create cluster:

# Only do the following for once
$ gcloud container clusters create sport-news-retrieval \
    --num-nodes 1 \
    --machine-type g1-small

# Check the newly created instances
$ gcloud compute instances list

Create your pods:

# Config gcloud and kubectl, only do the following for once
$ gcloud config set project sport-news-retrieval
$ gcloud config set compute/zone asia-east1-a
$ gcloud config set container/cluster sport-news-retrieval
$ gcloud container clusters get-credentials sport-news-retrieval

# If you want to create new replication controller (typically for first time users)
$ kubectl create -f kubernete/proxy-rc.yml
$ kubectl create -f kubernete/solr-rc.yml
$ kubectl create -f kubernete/recrawler-rc.yml

# If you just want to update images
$ kubectl rolling-update proxy-node${PROJECT_ID}/proxy:${VERSION}
$ kubectl rolling-update solr-node${PROJECT_ID}/solr:${VERSION}
$ kubectl rolling-update recrawler-node${PROJECT_ID}/recrawler:${VERSION}

Allow external traffic

$ kubectl create -f kubernete/proxy-service.yml
$ kubectl create -f kubernete/solr-service.yml
$ kubectl create -f kubernete/recrawler-service.yml

View Status:

$ kubectl get pods
$ kubectl get services

Indexing to solr:

$ bin/post -c sport -host <external ip of solr-node> data/all_data.json

Stop all pods and services:

$ kubectl delete services solr-node
$ kubectl delete services proxy-node
$ kubectl delete services recrawler-node

$ kubectl delete rc proxy-node
$ kubectl delete rc solr-node
$ kubectl delete services recrawler-node