Interpretable and Cautious Text Classification
Description
This is a collection of source code necessary for reproducibility.
Dataset
There are two main data folders: dataset and data.
- The dataset folder contains the datasets used in our experiments: 1) IMDB, 2) ArXiV, 3) AGnews.
We're providing all of the datasets except for IMDB, which is publicly available, in our submission.
- The data folder contains our generated lists of keywords for each dataset.
Source Code
The root directory consists of two main file types: .py and .sh
To reproduce the results, please run the command line as stated in the .sh files.
shell script
- run_baseline.sh: to reproduce the results from Logistic Regression
- run_hierarchical_attention.sh: to reproduce the results from HN and HAN (you need Glove Embedding)
- run_cautious.sh: to reproduce the results from our model
Python Code
- train.py: Main .py script of our model
- train_baseline.py: .py script for Logistic Regression
- train_hierarchical.py: .py script for HN and HAN
Parameter
Note that we're trying to uniform our model's setup linearly with Logistic Regression, thus the total number of parameter in our model is linear to the input document in initial assessment f_D + final classification f_C.
***Please follow the shell script