PlaceboAffect

This repo is used for our team PlaceboAffect for LING 573 Course at UW Seattle.

Project Overview
Folder Structure
Setup
Scripts
Configs
Outputs
License
Authors

Project Overview

We have developed an affect recognition system for SemEval-2019 Task 5, focusing on identifying hate speech in both English and Spanish tweets targeting immigrants and women. It employs a binary classification approach using a Word2Vec model for word embeddings and a Support Vector Machine (SVM) algorithm. To enhance the system's performance, we have incorporated additional lexical features such as n-grams and sentiment scores. For the Spanish tweets, we have employed a translation-based approach and integrated it into our existing pipeline.

Folder Structure

The project follows a structured folder organization to store data, models, outputs, results, scripts, and source code. Here is an overview of the folder structure:

├── data
│   ├── dev
│   │   ├── en
│   │   ├── es
│   │   └── es2en
│   ├── test
│   │   ├── en
│   │   ├── es
│   │   └── es2en
│   └── train
│       ├── en
│       ├── es
│       └── es2en
├── doc
├── models
│   ├── D2
│   ├── D3
│   └── D4
│       ├── adaptation
│       └── primary
├── outputs
│   ├── D2
│   ├── D3
│   └── D4
│       ├── adaptation
│       │   ├── devtest
│       │   └── evaltest
│       └── primary
│           ├── devtest
│           └── evaltest
├── results
│   ├── D2
│   ├── D3
│   └── D4
│       ├── adaptation
│       │   ├── devtest
│       │   └── evaltest
│       └── primary
│           ├── devtest
│           └── evaltest
├── scripts
├── setup
└── src
    ├── configs
    ├── features
    └── modeling

data: This folder contains the training, development, and test datasets for both English and Spanish for this task.
doc: This folder contains various reports, PowerPoint presentations, and additional documentation related to the project.
models: This folder contains the saved pickle files of each trained model.
outputs: This folder contains the prediction files generated by each model.
results: This folder contains the evaluation scores for each model.
scripts: This folder contains a collection of scripts that can be utilized to run different models with different configurations.
setup: This folder contains the setup script required for the conda environment.
src: This folder contains the source code for the system.

Setup

To set up the project environment, follow the steps below:

Navigate to the "setup" folder using the command line:
```
cd setup
```
Change the permission of the create_env.sh script to make it executable:
```
chmod +x create_env.sh
```
Run the create_env.sh script to create the conda environment:
```
./create_env.sh
```
Activate the newly created environment:
```
conda activate PlaceboAffect
```

Components

The project consists of the following components:

Translation
- File: scripts/translate.py
- This file contains functions and code for translating Spanish tweets into English. The translated files are stored in the designated folder for further processing within the system. Please note that this file does not run as an integral part of the system.
Preprocessing
- File: src/features/preprocess.py
- This file contains functions and code for preprocessing raw data, including cleaning, formatting, and transforming the data.
Feature Extraction
- File: src/features/extract_features.py
- This file includes functions and code for extracting relevant features from the preprocessed data, such as implementing bag-of-words, n-grams, or word embedding techniques.
Training
- File: src/modeling/classifier.py
- This file contains functions and code for training classification models using the preprocessed data and extracted features. It also includes hyperparameter fine-tuning to optimize the model performance.
Inference
- File: src/modeling/classifier.py
- This file contains functions and code for making predictions using the trained classification model. It utilizes the trained model to predict the corresponding label of a given tweet.
Evalutation
- File: src/main.py
- This file contains functions and code for evaluating the performance and effectiveness of the trained model. It includes calculation of metrics such as accuracy, precision, recall, and F1-score.

Scripts

scripts/model_runner.sh: This script is responsible for running the model with specified parameters. It can be executed from any directory and takes the following parameters:
- --mode or -m: Specifies the mode of operation, which can be either train or test. This parameter determines whether the system should perform training or testing.
- --task or -t: Specifies the task type, which can be either primary or adaptation. This parameter defines the type of task to be performed by the system.
- --model or -s: Specifies the name of the model. This parameter allows you to choose a specific model for the task. Available options are:
- baseline: BOW (Bag of Words)
- alpha: Embedding
- beta: Embedding with Empath
- gamma: Embedding with Empath, & N-Grams
- delta: Embedding with Empath, N-Grams & Sentiment
These three parameters are required for the system to function properly. Make sure to provide them when executing the script. Example usage: ./scripts/model_runner.sh -m train -t primary -s baseline. Please adjust the parameters as needed for your specific use case.
scripts/train.sh: This script is used to train the specified model for the primary task. It only takes the model name as input. Example usage:
```
./scripts/train.sh baseline
```
scripts/train_adapt.sh: This script is used to train the specified model for the adaptation task. It only takes the model name as input. Example usage:
```
./scripts/train_adapt.sh baseline
```
scripts/test.sh: This script is used to test the specified model for the primary task. It only takes the model name as input. Example usage:
```
./scripts/test.sh baseline
```
scripts/test_adapt.sh: This script is used to test the specified model for the adaptation task. It only takes the model name as input. Example usage:
```
./scripts/test_adapt.sh baseline
```
scripts/run_all.py: This script automatically executes the training and testing process for both the primary and adaptation tasks across all five models. It allows for a streamlined and efficient workflow by eliminating the need to manually execute the model_runner.sh script multiple times. To run the run_all.py script, use the following command:
```
python ./scripts/run_all.py
```
D4.cmd: This script is specifically designed to perform inference using the best model for the adaptation task. Before running the D4.cmd script, ensure that the best model for the adaptation task is available and properly trained. Please note that the D4.cmd script assumes that all the required configurations and dependencies are in place for the successful execution of the inference process. To run the D4.cmd script, use the following command on patas:
```
condor_submit D4.cmd
```

Configs

The project includes individual config files for each model, providing the flexibility to enable or disable specific featurse that are relvant to that model. These config files can be found in the configs directory. By modifying the config file corresponding to a particular model, you can control the specific features and settings used during training and testing.

Outputs

For each model, the system generates separate outputs for the primary task and adaptation task by specifying the task. Additionally, both the devtest and evaltest outputs are automatically generated for each task. Additionally, both the devtest and evaltest outputs are generated automatically for each task. For example, for the BOW Model (baseline), the specific files generated include:

Primary Task
- models/D4/primary/baseline.pkl
- outputs/D4/primary/devtest/pred_baseline.txt
- outputs/D4/primary/evaltest/pred_baseline.txt
- results/D4/primary/devtest/D4_scores.out
- results/D4/primary/evaltest/D4_scores.out
Adaptation Task
- models/D4/adaptation/baseline.pkl
- outputs/D4/adaptation/devtest/pred_baseline.txt
- outputs/D4/adaptation/evaltest/pred_baseline.txt
- results/D4/adaptation/devtest/D4_scores.out
- results/D4/adaptation/evaltest/D4_scores.out

For other models, please refer to the following directories for more information:

Models Directory: models/D4
Outputs Directory: outputs/D4
Results Directory: results/D4

These directories contain the model files, output files, and result files corresponding to each model.

License

Distributed under the Apache License 2.0. See LICENSE for more information.

Authors

Mohamed Elkamhawy
- Email: mohame@uw.edu
Karl Haraldsson
- Email: kharalds@uw.edu
Alex Maris
- Email: alexmar@uw.edu
Nora Miao
- Email: norah98@uw.edu

MElkamhawy / PlaceboAffect

readme