One Million Posts

Natural language processing project based on the one-million-posts dataset.

More than 3.000 user comments are written each day on www.derstandard.at. Moderators need to review these comments regarding several aspects like inappropriate language, discriminating content, off-topic comments, questions that need to be answered, and more.

We provide machine learning models that detect potentially problematic comments to ease the moderators' daily work.

Setup

Install pyenv.
Install python 3.8.5 via pyenv install 3.8.5
Run make setup.

For further instructions on how to run our code see SETUP.md.

Presentations

The presentations are found in `./presentations/`	Presentation file	Description
OneMillionPosts-GraduationEvent.pdf	Presentation of the graduation event from April 28, 2021
OneMillionPosts-Midterm.pdf	Midterm presentation of the project from April 12, 2021
OneMillionPosts-AnnotationComposition.pdf	EDA concerning ticket #24, #25

Modeling

The models' code is found in `./modeling/` in this repo. They are pushed as `.py` files. See SETUP.md.	Model	Description
gbert Classifier	German BERT base
Zero Shot Classifier	xlm-roberta-large-xnli
XGBoost	XGBoost
Logistic Regression	Logistic Regresssion
Support Vector Classifier	Support Vector Classifier
Random Forest Classifier	Random Forest Classifier
Naive Bayes Classifier	Naive Bayes Classifier
LightGBM	LightGBM algorithm not considered for further modeling

Data analysis

The notebooks are found in ./notebooks/. They are pushed as .py files. See SETUP.md.

dominikmn / one-million-posts

readme

One Million Posts

Setup

Presentations

Modeling

Data analysis