Identify and rank negative commenters on Hacker News using sentiment analysis, and make their comments and rankings accessible. Then deploy an API for the machine learning model and data visualizations.
App here: https://salted-hacker-news.herokuapp.com/
Clone the repo
git clone https://github.com/Salty-Hackers/data-engineering.git
cd data-engineering
Install dependencies
pipenv install --dev
Activate the virtual environment
pipenv shell
Launch the app
uvicorn app.main:app --reload
Go to localhost:8000
in your browser.
.
|── data
├── app
| ├── __init__.py
| ├── main.py
| ├── api
| │ ├── __init__.py
| │ ├── estimate.py
| | |── hn_scraper.py
| | |── preprocessing_and_sentiment.py
| └── tests
| ├── __init__.py
| ├── test_main.py
| ├── test_estimate.py
├── Data
| ├── hn_0.csv
| ├── hn_1.csv
| ├── hn_2.csv
| ├── hn_3.csv
| ├── hn_4.csv
| ├── hn_5.csv
| ├── hn_6.csv
| ├── hn_7.csv
| ├── hn_8.csv
| ├── hn_9.csv
| ├── hn_10.csv
| ├── hn_11.csv
├── notebooks
├── hn_preprocessing_and_sentiment_analysis.ipynb
Prepare Heroku
heroku login
heroku create YOUR-APP-NAME-GOES-HERE
heroku git:remote -a YOUR-APP-NAME-GOES-HERE
Deploy to Heroku
git add --all
git add --force Pipfile.lock
git commit -m "Deploy to Heroku"
git push heroku main:master
heroku open
(If you get a Locking failed!
error when deploying to Heroku or running pipenv install
then delete Pipfile.lock
and try again, without git add --force Pipfile.lock
)
Deactivate the virtual environment
exit
This dataset has 1.6 million observations of Hacker News comment data bewteen 2014-2015. It is a subset of a Google BigQuery that contains all stories and comments from Hacker News from its launch in 2006 to present. Source.
It is split into 12 datasets by date to be under GitHub's 100mb file size storage limit.
Field | Type | Description |
---|---|---|
id | Integer | The item's unique id. |
by |
String | The username of the item's author. |
author |
String | The username of the item's author. |
time |
Integer | Creation date of the item, in Unix Time. |
time_ts |
Datetime | The full date and time of the comment, including microseconds. |
text |
String | The comment, story or poll text. HTML. |
parent |
Integer | The comment's parent's id: either another comment or the relevant story. |
deleted |
Boolean | true if the item is deleted. |
dead |
Boolean | true if the item is dead. |
ranking |
Integer | The story's score, or the votes for a pollopt. |
These are sqlite3 databases that are queried.
Field | Type | Description |
---|---|---|
comment |
String | The comment text. |
user |
String | The username of the comment's author. |
date_time |
Datetime | The full date and time of the comment, excluding microseconds. |
sentiment_score |
Float | Vader's normalized composite sentiment score. -1 is most extreme negative, +1 is most extreme positive. |
sentiment |
String | The overall sentiment of the comment based on the sentiment_score . |
Field | Type | Description |
---|---|---|
user |
String | The username of the Hacker News commenter. |
avg_sentiment_score |
Float | The arithmetic mean of the sentiment of user's comments, from -1 to 1. |
num_comments |
Integer | The total number of comments by a user. |
sentiment_ranking |
Integer | Users ordered by their.avg_sentiment_score , from saltiest to sweetest. |