Salty-Hackers / data-engineering

MIT License
0 stars 2 forks source link

Project

Identify and rank negative commenters on Hacker News using sentiment analysis, and make their comments and rankings accessible. Then deploy an API for the machine learning model and data visualizations.

App here: https://salted-hacker-news.herokuapp.com/

Getting started

Clone the repo

git clone https://github.com/Salty-Hackers/data-engineering.git
cd data-engineering

Install dependencies

pipenv install --dev

Activate the virtual environment

pipenv shell

Launch the app

uvicorn app.main:app --reload

Go to localhost:8000 in your browser.

File structure

.
|── data
├── app
|    ├── __init__.py
|    ├── main.py
|    ├── api
|    │   ├── __init__.py
|    │   ├── estimate.py
|    |   |── hn_scraper.py
|    |   |── preprocessing_and_sentiment.py
|    └── tests
|        ├── __init__.py
|        ├── test_main.py
|        ├── test_estimate.py
├── Data
|        ├── hn_0.csv
|        ├── hn_1.csv
|        ├── hn_2.csv
|        ├── hn_3.csv
|        ├── hn_4.csv
|        ├── hn_5.csv
|        ├── hn_6.csv
|        ├── hn_7.csv
|        ├── hn_8.csv
|        ├── hn_9.csv
|        ├── hn_10.csv
|        ├── hn_11.csv
├── notebooks
    ├── hn_preprocessing_and_sentiment_analysis.ipynb

Deploying to Heroku

Prepare Heroku

heroku login

heroku create YOUR-APP-NAME-GOES-HERE

heroku git:remote -a YOUR-APP-NAME-GOES-HERE

Deploy to Heroku

git add --all

git add --force Pipfile.lock

git commit -m "Deploy to Heroku"

git push heroku main:master

heroku open

(If you get a Locking failed! error when deploying to Heroku or running pipenv install then delete Pipfile.lock and try again, without git add --force Pipfile.lock)

Deactivate the virtual environment

exit

Data Dictionaries

Raw dataset

This dataset has 1.6 million observations of Hacker News comment data bewteen 2014-2015. It is a subset of a Google BigQuery that contains all stories and comments from Hacker News from its launch in 2006 to present. Source.

It is split into 12 datasets by date to be under GitHub's 100mb file size storage limit.

Field Type Description
id Integer The item's unique id.
by String The username of the item's author.
author String The username of the item's author.
time Integer Creation date of the item, in Unix Time.
time_ts Datetime The full date and time of the comment, including microseconds.
text String The comment, story or poll text. HTML.
parent Integer The comment's parent's id: either another comment or the relevant story.
deleted Boolean true if the item is deleted.
dead Boolean true if the item is dead.
ranking Integer The story's score, or the votes for a pollopt.

Processed datasets

These are sqlite3 databases that are queried.

hn_comments

Field Type Description
comment String The comment text.
user String The username of the comment's author.
date_time Datetime The full date and time of the comment, excluding microseconds.
sentiment_score Float Vader's normalized composite sentiment score. -1 is most extreme negative, +1 is most extreme positive.
sentiment String The overall sentiment of the comment based on the sentiment_score.

hn_users

Field Type Description
user String The username of the Hacker News commenter.
avg_sentiment_score Float The arithmetic mean of the sentiment of user's comments, from -1 to 1.
num_comments Integer The total number of comments by a user.
sentiment_ranking Integer Users ordered by their.avg_sentiment_score, from saltiest to sweetest.