This repo is used for our team PlaceboAffect for LING 573 Course at UW Seattle.
We have developed an affect recognition system for SemEval-2019 Task 5, focusing on identifying hate speech in both English and Spanish tweets targeting immigrants and women. It employs a binary classification approach using a Word2Vec model for word embeddings and a Support Vector Machine (SVM) algorithm. To enhance the system's performance, we have incorporated additional lexical features such as n-grams and sentiment scores. For the Spanish tweets, we have employed a translation-based approach and integrated it into our existing pipeline.
The project follows a structured folder organization to store data, models, outputs, results, scripts, and source code. Here is an overview of the folder structure:
├── data
│ ├── dev
│ │ ├── en
│ │ ├── es
│ │ └── es2en
│ ├── test
│ │ ├── en
│ │ ├── es
│ │ └── es2en
│ └── train
│ ├── en
│ ├── es
│ └── es2en
├── doc
├── models
│ ├── D2
│ ├── D3
│ └── D4
│ ├── adaptation
│ └── primary
├── outputs
│ ├── D2
│ ├── D3
│ └── D4
│ ├── adaptation
│ │ ├── devtest
│ │ └── evaltest
│ └── primary
│ ├── devtest
│ └── evaltest
├── results
│ ├── D2
│ ├── D3
│ └── D4
│ ├── adaptation
│ │ ├── devtest
│ │ └── evaltest
│ └── primary
│ ├── devtest
│ └── evaltest
├── scripts
├── setup
└── src
├── configs
├── features
└── modeling
data: This folder contains the training, development, and test datasets for both English and Spanish for this task.
doc: This folder contains various reports, PowerPoint presentations, and additional documentation related to the project.
models: This folder contains the saved pickle files of each trained model.
outputs: This folder contains the prediction files generated by each model.
results: This folder contains the evaluation scores for each model.
scripts: This folder contains a collection of scripts that can be utilized to run different models with different configurations.
setup: This folder contains the setup script required for the conda environment.
src: This folder contains the source code for the system.
To set up the project environment, follow the steps below:
Navigate to the "setup" folder using the command line:
cd setup
Change the permission of the create_env.sh script to make it executable:
chmod +x create_env.sh
Run the create_env.sh script to create the conda environment:
./create_env.sh
Activate the newly created environment:
conda activate PlaceboAffect
The project consists of the following components:
Translation
scripts/translate.py
Preprocessing
src/features/preprocess.py
Feature Extraction
src/features/extract_features.py
Training
src/modeling/classifier.py
Inference
src/modeling/classifier.py
Evalutation
src/main.py
scripts/model_runner.sh
: This script is responsible for running the model with specified parameters. It can be executed from any directory and takes the following parameters:
--mode
or -m
: Specifies the mode of operation, which can be either train
or test
. This parameter determines whether the system should perform training or testing.
--task
or -t
: Specifies the task type, which can be either primary
or adaptation
. This parameter defines the type of task to be performed by the system.
--model
or -s
: Specifies the name of the model. This parameter allows you to choose a specific model for the task. Available options are:
baseline
: BOW (Bag of Words)
alpha
: Embedding
beta
: Embedding with Empath
gamma
: Embedding with Empath, & N-Grams
delta
: Embedding with Empath, N-Grams & Sentiment
These three parameters are required for the system to function properly. Make sure to provide them when executing the script. Example usage: ./scripts/model_runner.sh -m train -t primary -s baseline
.
Please adjust the parameters as needed for your specific use case.
scripts/train.sh
: This script is used to train the specified model for the primary task. It only takes the model name as input. Example usage:
./scripts/train.sh baseline
scripts/train_adapt.sh
: This script is used to train the specified model for the adaptation task. It only takes the model name as input. Example usage:
./scripts/train_adapt.sh baseline
scripts/test.sh
: This script is used to test the specified model for the primary task. It only takes the model name as input. Example usage:
./scripts/test.sh baseline
scripts/test_adapt.sh
: This script is used to test the specified model for the adaptation task. It only takes the model name as input. Example usage:
./scripts/test_adapt.sh baseline
scripts/run_all.py
: This script automatically executes the training and testing process for both the primary and adaptation tasks across all five models. It allows for a streamlined and efficient workflow by eliminating the need to manually execute the model_runner.sh script multiple times. To run the run_all.py script, use the following command:
python ./scripts/run_all.py
D4.cmd
: This script is specifically designed to perform inference using the best model for the adaptation task. Before running the D4.cmd script, ensure that the best model for the adaptation task is available and properly trained. Please note that the D4.cmd script assumes that all the required configurations and dependencies are in place for the successful execution of the inference process. To run the D4.cmd script, use the following command on patas:
condor_submit D4.cmd
The project includes individual config files for each model, providing the flexibility to enable or disable specific featurse that are relvant to that model. These config files can be found in the configs directory. By modifying the config file corresponding to a particular model, you can control the specific features and settings used during training and testing.
For each model, the system generates separate outputs for the primary task and adaptation task by specifying the task. Additionally, both the devtest and evaltest outputs are automatically generated for each task. Additionally, both the devtest and evaltest outputs are generated automatically for each task. For example, for the BOW Model (baseline), the specific files generated include:
Primary Task
models/D4/primary/baseline.pkl
outputs/D4/primary/devtest/pred_baseline.txt
outputs/D4/primary/evaltest/pred_baseline.txt
results/D4/primary/devtest/D4_scores.out
results/D4/primary/evaltest/D4_scores.out
Adaptation Task
models/D4/adaptation/baseline.pkl
outputs/D4/adaptation/devtest/pred_baseline.txt
outputs/D4/adaptation/evaltest/pred_baseline.txt
results/D4/adaptation/devtest/D4_scores.out
results/D4/adaptation/evaltest/D4_scores.out
For other models, please refer to the following directories for more information:
Models Directory: models/D4
Outputs Directory: outputs/D4
Results Directory: results/D4
These directories contain the model files, output files, and result files corresponding to each model.
Distributed under the Apache License 2.0. See LICENSE for more information.