This repository contains the source code and data used for the paper:
Automatic Generation of Topic Labels (2020) Areej Alokaili, Nikolaos Aletras and Mark Stevenson in Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR ’20), July 25–30, 2020, Virtual Event, China. https://doi.org/10.1145/3397271.3401185 Pre-print
Python 3.6.9 is used.
below libraries needed for evaluation only. You can skip if you want to do different evaluation metric other than BERTScore
use pip install -r requirements.txt
to install all needed libraries
To run the model (data are processed and ready, only training is needed):
Navigate to topic_labelling/
python train_tf.py -m 'bigru_bahdanau_attention' -d 'wiki_tfidf'
python train_tf.py -m 'bigru_bahdanau_attention' -d 'wiki_sent'
Training will stop if no improvment is recorded and all checkpoints will be saved in training_checkpoint/data_name/ .
python train_tf.py -h
Generate TITLES for a subset of wikipedia articles (1000 articles)
python test_tf.py -m 'bigru_bahdanau_attention' -s 1000 -d 'wiki_tfidf' --load 'NAME_OF_CHECKPOINT'
*replace NAME_OF_CHECKPOINT with the name of your checkpoint. For example, python test_tf.py -d 'wiki_tfidf' -m 'bigru_bahdanau_attention' --load bigru_bahdanau_attention_e_1_valloss2.19-2
Generate LABELS for bhatia_topics
python test_tf.py -m 'bigru_bahdanau_attention' -s 1000 -d 'wiki_tfidf' --load 'NAME_OF_CHECKPOINT' -te 'bhatia_topics'
Generate LABELS for bhatia_topics_tfidf
python test_tf.py -m 'bigru_bahdanau_attention' -s 1000 -d 'wiki_tfidf' --load 'NAME_OF_CHECKPOINT' -te 'bhatia_topics_tfidf'
Predictions, golds, and topics will be stored at results/data_name/ as
To measure the similarity between predicted and gold labels,
python compute_bertscore.py -g results/path_to_gold_file.out -p results/path_to_predict_file.out
Output includes precision (P), recall (R) and f-score (F).
train_tf.py code to train the labelling network.
test_tf.py code to generate new titles/labels.
model_archi_tf.py neural network structure defind here.
support_methods.py contain some method needed methods through out the system.
extract_additional_terms_for_topics.ipynb notebook showing the steps taken to filter topic/labels pairs based on the overall human rating and matching them to similar documents to extract additional terms for bhatia_topics_tfidf.
compute_bertscore.py: script to compute pairwise BERTScore between predicted titles/labels and gold titles/labels.
data
results: this is where the model's output are saved in text files.
training_checkpoints: model checkpoints are saved here.