ItsLastDay / StackOverflow_Map

A project for creating 2D visualization of StackOverflow tags.
GNU General Public License v3.0
3 stars 1 forks source link
graphs machine-learning machinelearning python stackoverflow t-sne visualization

StackOverflow Map

This is a project for creating 2D visualization of StackOverflow tags using t-SNE algorithm. See a bit more thorough problem description.

Everything is written in Python 3.5 and C++.

Live demonstration: https://tag-map.github.io/

Project structure

Our repository follows cookiecutter Data Science template.

Prerequisites

We use Python 3 as a main tool, so you need a Python interpreter (e.g. cPython). Make sure you install every needed Python package from requirements.txt, e.g. via
pip3 install -r requirements.txt
Aside from Python, C++ 11 is employed in time-critical places. Make sure you have a suitable compiler (e.g. gcc).
The analysis is run via Makefile, so you need to have make installed. Our Makefile was tested on Ubuntu 16.04.

How to use

Running the example

After installing all prerequisites, you may want to run our example dataset (consisting of 376 tags) to be sure everything is allright. First, clone the repository via command
git clone https://github.com/ItsLastDay/StackOverflow_Map.git

Then type (from the root of the repository)
make visualize_example
It should complete in a matter of minutes. Then go to src/visualization folder and start a server:

cd ./src/visualization
python3 run_server.py

As a final step, open http://localhost:8000/ in your favourite web browser. You should see something like this:
example_8dec

You can navigate on the map using mouse buttons and zoom via scroll button. example_8dec_text

Running a full visualization

Now you are ready for the main part - running a visualization on the whole 50k+ set of tags. Our script allows you to specify a date POST_DATE. All posts earlier than this date will be filtered prior to making a visualization. This affects measuring the similarity between two tags: since we count number of questions that have both tags, filtering old questions makes the similarity more current.

From the root of the repository, type
make visualize POST_DATE=2012-08-25
(of course, you can specify any other date, but it must follow YYYY-MM-DD format)

This command requires several hours to complete. It will write tags in a separate folder with a POST_DATE value in it, e.g. tiles_2012-08-25. Don't hesitate to try different POST_DATE's - they do not overwrite each other! Then perform the steps described above:

cd ./src/visualization
python3 run_server.py

Open http://localhost:8000/ in your web browser. You will see a drop-down list on the left. There you can choose which visualization to show. Choose the one according to specified POST_DATE. Hooray, you now see a full set of tags! full_dec8

Check out a video demonstration (click to play):
Click to play on youtube

Authors

Mikhail Koltsov (ItsLastDay)
Arkady Kalakutsky (testlnord)