OctoberChang / X-Transformer

X-Transformer: Taming Pretrained Transformers for eXtreme Multi-label Text Classification
BSD 3-Clause "New" or "Revised" License
135 stars 28 forks source link
extreme-multi-label-classification pytorch text-classification transformers

Taming Pretrained Transformers for XMC problems

This is a README for the experimental code of the following paper

Taming Pretrained Transformers for eXtreme Multi-label Text Classification

Wei-Cheng Chang, Hsiang-Fu Yu, Kai Zhong, Yiming Yang, Inderjit Dhillon

KDD 2020

Upates (2021-04-27)

Latest implementation (faster training with stronger performance) of X-Transformer is available at PECOS, feel free to try it out!

Installation

Depedencies via Conda Environment

> conda env create -f environment.yml
> source activate pt1.2_xmlc_transformer
> (pt1.2_xmlc_transformer) pip install -e .
> (pt1.2_xmlc_transformer) python setup.py install --force

**Notice: the following examples are executed under the > (pt1.2_xmlc_transformer) conda virtual environment

Reproduce Evaulation Results in the Paper

We demonstrate how to reproduce the evaluation results in our paper by downloading the raw dataset and pretrained models.

Download Dataset (Eurlex-4K, Wiki10-31K, AmazonCat-13K, Wiki-500K)

Change directory into ./datasets folder, download and unzip each dataset

cd ./datasets
bash download-data.sh Eurlex-4K
bash download-data.sh Wiki10-31K
bash download-data.sh AmazonCat-13K
bash download-data.sh Wiki-500K
cd ../

Each dataset contains the following files

Download Pretrained Models (processed data, Indexing codes, fine-tuned Transformer models)

Change directory into ./pretrained_models folder, download and unzip models for each dataset

cd ./pretrained_models
bash download-models.sh Eurlex-4K
bash download-models.sh Wiki10-31K
bash download-models.sh AmazonCat-13K
bash download-models.sh Wiki-500K
cd ../

Each folder has the following strcture

Evaluate Linear Models

Given the provided indexing codes (label-to-cluster assignments), train/predict linear models, and evaluate with Precision/Recall@k:

bash eval_linear.sh ${DATASET} ${VERSION}

The evaluaiton results should located at ./results_linear/${DATASET}.${VERSION}.txt

Evaluate Fine-tuned X-Transformer Models

Given the provided indexing codes (label-to-cluster assignments) and the fine-tuned Transformer models, train/predict ranker of the X-Transformer framework, and evaluate with Precision/Recall@k:

bash eval_transformer.sh ${DATASET}

The evaluaiton results should located at ./results_transformer/${DATASET}.final.txt

Running X-Transformer on customized datasets

The X-Transformer framework consists of 9 configurations (3 label-embedding times 3 model-type). For simplicity, we show you 1 out-of 9 here, using LABEL_EMB=pifa-tfidf and MODEL_TYPE=bert.

We will use Eurlex-4K as an example. In the ./datasets/Eurlex-4K folder, we assume the following files are provided:

Given those input files, the pipeline can be divided into three stages: Indexer, Matcher, and Ranker.

Indexer

In stage 1, we will do the following

TLDR: we combine and summarize (1),(2),(3) into two scripts: run_preprocess_label.sh and run_preprocess_feat.sh. See more detailed explaination in the following.

(1) To construct label embedding,

OUTPUT_DIR=save_models/${DATASET}
PROC_DATA_DIR=${OUTPUT_DIR}/proc_data
mkdir -p ${PROC_DATA_DIR}
python -m xbert.preprocess \
    --do_label_embedding \
    -i ${DATA_DIR} \
    -o ${PROC_DATA_DIR} \
    -l ${LABEL_EMB} \
    -x ${LABEL_EMB_INST_PATH}

This should yield L.${LABEL_EMB}.npz in the PROC_DATA_DIR.

(2) To perform hierarchical 2-means,

SEED_LIST=( 0 1 2 )
for SEED in "${SEED_LIST[@]}"; do
    LABEL_EMB_NAME=${LABEL_EMB}-s${SEED}
    INDEXER_DIR=${OUTPUT_DIR}/${LABEL_EMB_NAME}/indexer
    python -u -m xbert.indexer \
    python -m xbert.preprocess \
        -i ${PROC_DATA_DIR}/L.${LABEL_EMB}.npz \
        -o ${INDEXER_DIR} --seed ${SEED}

This should yield code.npz in the INDEXIER_DIR.

(3) To preprocess input and output for Transformer models,

SEED=0
LABEL_EMB_NAME=${LABEL_EMB}-s${SEED}
INDEXER_DIR=${OUTPUT_DIR}/${LABEL_EMB_NAME}/indexer
python -u -m xbert.preprocess \
    --do_proc_label \
    -i ${DATA_DIR} \
    -o ${PROC_DATA_DIR} \
    -l ${LABEL_EMB_NAME} \
    -c ${INDEXER_DIR}/code.npz

This should yield the instance-to-cluster matrix C.trn.npz and C.tst.npz in the PROC_DATA_DIR.

OUTPUT_DIR=save_models/${DATASET}
PROC_DATA_DIR=${OUTPUT_DIR}/proc_data
python -u -m xbert.preprocess \
    --do_proc_feat \
    -i ${DATA_DIR} \
    -o ${PROC_DATA_DIR} \
    -m ${MODEL_TYPE} \
    -n ${MODEL_NAME} \
    --max_xseq_len ${MAX_XSEQ_LEN} \
    |& tee ${PROC_DATA_DIR}/log.${MODEL_TYPE}.${MAX_XSEQ_LEN}.txt

This should yield X.trn.${MODEL_TYPE}.${MAX_XSEQ_LEN}.pt and X.tst.${MODEL_TYPE}.${MAX_XSEQ_LEN}.pt in the PROC_DATA_DIR.

Matcher

In stage 2, we will do the following

TLDR: run_transformer_train.sh. See more detailed explaination in the following.

(1) Assume we have 8 Nvidia V100 GPUs. To train the models,

MODEL_DIR=${OUTPUT_DIR}/${INDEXER_NAME}/matcher/${MODEL_NAME}
mkdir -p ${MODEL_DIR}
python -m torch.distributed.launch \
    --nproc_per_node 8 xbert/transformer.py \
    -m ${MODEL_TYPE} -n ${MODEL_NAME} --do_train \
    -x_trn ${PROC_DATA_DIR}/X.trn.${MODEL_TYPE}.${MAX_XSEQ_LEN}.pkl \
    -c_trn ${PROC_DATA_DIR}/C.trn.${INDEXER_NAME}.npz \
    -o ${MODEL_DIR} --overwrite_output_dir \
    --per_device_train_batch_size ${PER_DEVICE_TRN_BSZ} \
    --gradient_accumulation_steps ${GRAD_ACCU_STEPS} \
    --max_steps ${MAX_STEPS} \
    --warmup_steps ${WARMUP_STEPS} \
    --learning_rate ${LEARNING_RATE} \
    --logging_steps ${LOGGING_STEPS} \
    |& tee ${MODEL_DIR}/log.txt

(2) To generate predictions and instance embedding,

GPID=0,1,2,3,4,5,6,7
PER_DEVICE_VAL_BSZ=32
CUDA_VISIBLE_DEVICES=${GPID} python -u xbert/transformer.py
    -m ${MODEL_TYPE} -n ${MODEL_NAME} \
    --do_eval -o ${MODEL_DIR} \
    -x_trn ${PROC_DATA_DIR}/X.trn.${MODEL_TYPE}.${MAX_XSEQ_LEN}.pkl \
    -c_trn ${PROC_DATA_DIR}/C.trn.${INDEXER_NAME}.npz \
    -x_tst ${PROC_DATA_DIR}/X.tst.${MODEL_TYPE}.${MAX_XSEQ_LEN}.pkl \
    -c_tst ${PROC_DATA_DIR}/C.tst.${INDEXER_NAME}.npz \
    --per_device_eval_batch_size ${PER_DEVICE_VAL_BSZ}

This should yield the following output in the MODEL_DIR

Ranker

In stage 3, we will do the following

TLDR: run_transformer_predict.sh. See more detailed explaination in the following.

(1) To train linear rankers,

LABEL_NAME=pifa-tfidf-s0
MODEL_NAME=bert-large-cased-whole-word-masking
OUTPUT_DIR=save_models/${DATASET}/${LABEL_NAME}
INDEXER_DIR=${OUTPUT_DIR}/indexer
MATCHER_DIR=${OUTPUT_DIR}/matcher/${MODEL_NAME}
RANKER_DIR=${OUTPUT_DIR}/ranker/${MODEL_NAME}
mkdir -p ${RANKER_DIR}
python -m xbert.ranker train \
    -x1 ${DATA_DIR}/X.trn.npz \
    -x2 ${MATCHER_DIR}/trn_embeddings.npy \
    -y ${DATA_DIR}/Y.trn.npz \
    -z ${MATCHER_DIR}/C_trn_pred.npz \
    -c ${INDEXER_DIR}/code.npz \
    -o ${RANKER_DIR} -t 0.01 \
    -f 0 --mode ranker

(2) To predict the final top-k labels,

PRED_NPZ_PATH=${RANKER_DIR}/tst.pred.npz
python -m xbert.ranker predict \
    -m ${RANKER_DIR} -o ${PRED_NPZ_PATH} \
    -x1 ${DATA_DIR}/X.tst.npz \
    -x2 ${MATCHER_DIR}/tst_embeddings.npy \
    -y ${DATA_DIR}/Y.tst.npz \
    -z ${MATCHER_DIR}/C_tst_pred.npz \
    -f 0 -t noop

This should yield the predicted top-k labels tst.pred.npz specified in PRED_NPZ_PATH.

Acknowledge

Some portions of this repo is borrowed from the following repos: