YJiangcm / PromCSE

Code for "Improved Universal Sentence Embeddings with Prompt-based Contrastive Learning and Energy-based Learning (EMNLP 2022)"
https://arxiv.org/abs/2203.06875v2
133 stars 15 forks source link
contrastive-learning energy-based-learning prompt pytorch sentence-embeddings

PromCSE: Improved Universal Sentence Embeddings with Prompt-based Contrastive Learning and Energy-based Learning

PyPI - Package Version Open In Colab huggingface License: MIT

Our code is modified based on SimCSE and P-tuning v2. Here we would like to sincerely thank them for their excellent works.

**** Updates ****

Quick Links

Overview

Model List

We have released our supervised and unsupervised models on huggingface, which acquire Top 1 results on 1 domain-shifted STS task and 4 standard STS tasks:

PWC

PWC

PWC

PWC

PWC

PWC

PWC

Model STS12 STS13 STS14 STS15 STS16 STS-B SICK-R Avg.
YuxinJiang/unsup-promcse-bert-base-uncased 73.03 85.18 76.70 84.19 79.69 80.62 70.00 78.49
YuxinJiang/sup-promcse-roberta-base 76.75 85.86 80.98 86.51 83.51 86.58 80.41 82.94
YuxinJiang/sup-promcse-roberta-large 79.14 88.64 83.73 87.33 84.57 87.84 82.07 84.76

Naming rules: unsup and sup represent "unsupervised" (trained on Wikipedia corpus) and "supervised" (trained on NLI datasets) respectively.

Usage

Open In Colab

We provide an easy-to-use python package promcse which contains the following functions:

(1) encode sentences into embedding vectors;
(2) compute cosine simiarities between sentences;
(3) given queries, retrieval top-k semantically similar sentences for each query.

To use the tool, first install the promcse package from PyPI

pip install promcse

After installing the package, you can load our model by two lines of code

from promcse import PromCSE
model = PromCSE("YuxinJiang/unsup-promcse-bert-base-uncased", "cls_before_pooler", 16)
# model = PromCSE("YuxinJiang/sup-promcse-roberta-base")
# model = PromCSE("YuxinJiang/sup-promcse-roberta-large")

Then you can use our model for encoding sentences into embeddings

embeddings = model.encode("A woman is reading.")

Compute the cosine similarities between two groups of sentences

sentences_a = ['A woman is reading.', 'A man is playing a guitar.']
sentences_b = ['He plays guitar.', 'A woman is making a photo.']
similarities = model.similarity(sentences_a, sentences_b)

Or build index for a group of sentences and search among them

sentences = ['A woman is reading.', 'A man is playing a guitar.']
model.build_index(sentences)
results = model.search("He plays guitar.")

Train PromCSE

In the following section, we describe how to train a PromCSE model by using our code.

Setups

Python Pytorch

You should install the correct version of PyTorch that supports CUDA. Then run the following script to install the remaining dependencies,

pip install -r requirements.txt

Evaluation

Open In Colab

Our evaluation code for sentence embeddings is based on a modified version of SentEval. It evaluates sentence embeddings on semantic textual similarity (STS) tasks and downstream transfer tasks. For STS tasks, our evaluation takes the "all" setting, and report Spearman's correlation. The STS tasks include seven standard STS tasks (STS12-16, STSB, SICK-R) and one domain-shifted STS task (CxC).

Before evaluation, please download the evaluation datasets by running

cd SentEval/data/downstream/
bash download_dataset.sh

To evaluate the domain shift robustness of sentence embedding, we need to download CxC, and put the data into SentEval/data/downstream/CocoCXC

Then come back to the root directory, you can evaluate the well trained models using our evaluation code. For example,

python evaluation.py \
    --model_name_or_path YuxinJiang/sup-promcse-roberta-large \
    --pooler_type cls \
    --task_set sts \
    --mode test \
    --pre_seq_len 10

which is expected to output the results in a tabular format:

------ test ------
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness |  Avg. |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| 79.14 | 88.64 | 83.73 | 87.33 | 84.57 |    87.84     |      82.07      | 84.76 |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+

Arguments for the evaluation script are as follows,

Training

Data

Following SimCSE, we use the same datasets to train our unsupervised models and supervised models. You can run data/download_wiki.sh and data/download_nli.sh to download the two datasets.

Training scripts
(The same as run_unsup_example.sh)

python train.py \
    --model_name_or_path bert-base-uncased \
    --train_file data/wiki1m_for_simcse.txt \
    --output_dir result/my-unsup-promcse-bert-base-uncased \
    --num_train_epochs 1 \
    --per_device_train_batch_size 256 \
    --learning_rate 3e-2 \
    --max_seq_length 32 \
    --evaluation_strategy steps \
    --metric_for_best_model stsb_spearman \
    --load_best_model_at_end \
    --eval_steps 125 \
    --pooler_type cls \
    --mlp_only_train \
    --pre_seq_len 16 \
    --overwrite_output_dir \
    --temp 0.05 \
    --do_train \
    --do_eval \
    --fp16

We provide example training scripts for both unsupervised and supervised PromCSE. In run_unsup_example.sh, we provide a single-GPU (or CPU) example for the unsupervised version, and in run_sup_example.sh we give a multiple-GPU example for the supervised version. Both scripts call train.py for training. We explain the arguments in following:

All the other arguments are standard Huggingface's transformers training arguments. Some of the often-used arguments are: --output_dir, --learning_rate, --per_device_train_batch_size. In our example scripts, we also set to evaluate the model on the STS-B development set (need to download the dataset following the evaluation section) and save the best checkpoint.

All our experiments are conducted on Nvidia 3090 GPUs.

Hyperparameters

Unsupervised BERT-base BERT-large RoBERTa-base RoBERTa-large
Batch size 256 256 64 64
Learning rate 3e-2 3e-2 3e-2 1e-2
Prompt length 16 10 14 10
do_mlm False False True True
Epoch 1 1 1 1
Valid steps 125 125 125 125
Supervised BERT-base BERT-large RoBERTa-base RoBERTa-large
Batch size 256 256 512 512
Learning rate 1e-2 5e-3 1e-2 5e-3
Prompt length 12 12 10 10
do_mlm False False False False
Epoch 10 10 10 10
Valid steps 125 125 125 125

Citation

Please cite our paper by:

@inproceedings{jiang-etal-2022-improved,
    title = "Improved Universal Sentence Embeddings with Prompt-based Contrastive Learning and Energy-based Learning",
    author = "Jiang, Yuxin  and
      Zhang, Linhan  and
      Wang, Wei",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.findings-emnlp.220",
    pages = "3021--3035",
}

@article{DBLP:journals/corr/abs-2203-06875,
  author       = {Yuxin Jiang and
                  Wei Wang},
  title        = {Deep Continuous Prompt for Contrastive Learning of Sentence Embeddings},
  journal      = {CoRR},
  volume       = {abs/2203.06875},
  year         = {2022},
  url          = {https://doi.org/10.48550/arXiv.2203.06875},
  doi          = {10.48550/ARXIV.2203.06875},
  eprinttype    = {arXiv},
  eprint       = {2203.06875},
  timestamp    = {Wed, 16 Mar 2022 16:41:29 +0100},
  biburl       = {https://dblp.org/rec/journals/corr/abs-2203-06875.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}