dominikmn / one-million-posts

Assisting newspaper moderators with machine learning.
MIT License
2 stars 1 forks source link
bert data-science data-science-projects deep-learning german german-language hate-speech machine-learning natural-language-processing newspaper nlp one-million-posts python

One Million Posts

Natural language processing project based on the one-million-posts dataset.

More than 3.000 user comments are written each day on www.derstandard.at. Moderators need to review these comments regarding several aspects like inappropriate language, discriminating content, off-topic comments, questions that need to be answered, and more.

We provide machine learning models that detect potentially problematic comments to ease the moderators' daily work.

Setup

  1. Install pyenv.
  2. Install python 3.8.5 via pyenv install 3.8.5
  3. Run make setup.

For further instructions on how to run our code see SETUP.md.

Presentations

The presentations are found in ./presentations/ Presentation file Description
OneMillionPosts-GraduationEvent.pdf Presentation of the graduation event from April 28, 2021
OneMillionPosts-Midterm.pdf Midterm presentation of the project from April 12, 2021
OneMillionPosts-AnnotationComposition.pdf EDA concerning ticket #24, #25

Modeling

The models' code is found in ./modeling/ in this repo. They are pushed as .py files. See SETUP.md. Model Description
gbert Classifier German BERT base
Zero Shot Classifier xlm-roberta-large-xnli
XGBoost XGBoost
Logistic Regression Logistic Regresssion
Support Vector Classifier Support Vector Classifier
Random Forest Classifier Random Forest Classifier
Naive Bayes Classifier Naive Bayes Classifier
LightGBM LightGBM algorithm not considered for further modeling

Data analysis

The notebooks are found in ./notebooks/. They are pushed as .py files. See SETUP.md.