Cross-lingual Phrase Retriever

This repository contains the code and pre-trained models for our paper XPR: Cross-lingual Phrase Retriever.

**** Updates ****

5/10 We released our model checkpoint, evaluation code and dataset.
4/19 We released our paper.
2/26 Our paper has been accepted to ACL2022.

Overview

We propose a cross-lingual phrase retriever that extracts phrase representations from unlabeled example sentences named XPR.

Dataset

We also create a large-scale cross-lingual phrase retrieval dataset, which contains 65K bilingual phrase pairs and 4.2M example sentences in 8 English-centric language pairs.

Getting Started

In the following sections, we describe how to use our XPR.

Requirements

First, install PyTorch by following the instructions from the official website. To faithfully reproduce our results, please use the correct torch==1.8.1+cu111 version corresponding to your platforms/CUDA versions. PyTorch version higher than 1.8.1 should also work.

Then, run the following script to fetch the repo and install the remaining dependencies.

git clone git@github.com:cwszz/XPR.git
cd xpr
pip install -r requirements.txt
mkdir data
mkdir model
mkdir result

Dataset

Before using XPR, please process the dataset by following the steps below.

Download Our Dataset Here: link
Unzip our dataset and move dataset into data folder. (Make sure the path in bash file is the path of dataset)

Checkpoint

Before using XPR, please process the checkpoint by following the steps below.

Download Our Checkpoint Here: link
Get our checkpoint files and move the files in repo into model folder.

Train XPR

bash train.sh

Evaluation

Test our method:

Download the XPR checkpoint from Huggingface: [link]
Make sure the model path and dataset path in test.sh are correct
The output log can be found in log folder

Here is an example for evaluate XPR:

bash test.sh

export CUDA_VISIBLE_DEVICES='0'
python3 predict.py \
--lg $lg \
--test_lg $test_lg \
--dataset_path ./datset/ \
--load_model_path ./model/pytorch_model.bin \
--queue_length 0 \
--unsupervised 0 \
--wo_projection 0 \
--layer_id = 12 \
> log/test-${lg}-${test_lg}-32.log 2>&1

$lg: The language on which the model was trained
$test_lg: The language on which the model will be tested on
--dataset_path: The path of dataset folder
--load_model_path: The path of checkpoint folder
--queue_length: The length of memory queue
--unsupervised: Unsupervised mode
--wo_projection: Without SimCLR projection head
--layer_id: The layer to represent phrase

References

Please cite this paper, if you found the resources in this repository useful.

cwszz / XPR

readme

Cross-lingual Phrase Retriever

Overview

Dataset

Getting Started

Requirements

Dataset

Checkpoint

Train XPR

Evaluation

References