This is a package to fine-tune language models in order to create clustering-friendly embeddings. It is based on the paper Supervised clustering loss for clustering-friendly sentence embeddings: An application to intent clustering, (2023) G Barnabo, A Uva, S Pollastrini, C Rubagotti, D Bernardi
@Inproceedings{Barnabo2023,
author = {Giorgio Barnabo and Antonio Uva and Sandro Pollastrini and Chiara Rubagotti and Davide Bernardi},
title = {Supervised clustering loss for clustering-friendly sentence embeddings: An application to intent clustering},
year = {2023},
url = {https://www.amazon.science/publications/supervised-clustering-loss-for-clustering-friendly-sentence-embeddings-an-application-to-intent-clustering},
booktitle = {IJCNLP-AACL 2023},
}
p-lightning-template
| conf # contains Hydra config files
| data
| model
| train
root.yaml # hydra root config file
| data # datasets should go here
| experiments # where the models are stored
| src
| pl_data_modules.py # base LightinigDataModule
| pl_modules.py # base LightningModule
| train.py # main script for training the network
| README.md
| requirements.txt
| setup.sh # environment setup script
The structure of the repository is very simplistic and involves mainly four components:
In order to set up the python interpreter we utilize conda
, the script setup.sh
creates a conda environment and install pytorch
and the dependencies in "requirements.txt".
To use this repository as a starting template for your projects, you can just click the green button "Use this template" at the top of this page. More on using GitHub repositories on the following link.
Q: When I run any script using a Hydra config I can see that relative paths do not work. Why?
A: Whenever you run a script that uses a Hydra config, Hydra will create a new working directory (specified in the root.yaml file) for you. Every relative path you will use will start from it, and this is why you get the 'FileNotFound' error. However, using a different working directory for each of your experiments has a couple of benefits that you can read in the Hydra documentation for the Working directory. There are several ways that hydra offers as a workaround for this problem here we will report the two that the authors of this repository use the most, but you can find the other on the link we previously mentioned:
You can use the 'hydra.utils.to_absolute_path' function that will convert every relative path starting from your working directory (p-lightning-template in this project) to a full path that will be accessible from inside the new working dir.
Hydra will provide you with a reference to the original working directory in your config files. You can access it under the name of 'hydra:runtime.cwd'. So, for example, if your training dataset has the relative path 'data/train.tsv' you can convert it to a full path by prepending the hydra variable before
Contributions are always more than welcome, the only thing to take into account when submitting a pull request is that we utilize the Black code formatter with a max length for the code of 120. More pragmatically you should ensure to utilize the command "black -l 120" on the whole "src" directory before pushing the code.
This repository has been created with the idea of providing a simple skeleton from which you can start a PyTorch Lightning project. Instead of favoring the customizability, we favored the simplicity and we intended this template as a base for building more specific templates based on the user needs (for example by forking this one). However, there are several other repositories with different features that you can check if interested. We will list two of them here:
You need to have conda installed. Please refer to the conda installation page. The miniconda version is sufficient.
After conda installation, run the following script
sh setup.sh
Download a sentence-transformer model by selecting is name here: HuggingFace Sentence-Transformer.
E.g., suppose that you selected paraphrase-multilingual-MiniLM-L12-v2
, the run this code.
Run:
python download_base_model.py <name-of-the-model>
It will download it in the folder base_language_models.
when we tested our model we used the following four base sentence encoders base_language_model_folder:
bert-base-multilingual-cased
xlm-roberta-base
sentence-transformers_all-mpnet-base-v2
paraphrase-multilingual-mpnet-base-v2
We suggest to use paraphrase-multilingual-mpnet-base-v2
, which gives good performances even without fine-tuning.
To fine-tune any of the 4 base sentence encoders you should follow these steps.
supervised-intent-clustering/data/New_Fine_Tuning_Dataset
.
Each file (train, dev, test) should come in the form of a csv file with the following columns: dataset,utterance_id,utterance_split,utterance_lang,utterance_text,utterance_intent
supervised-intent-clustering/conf/dataset_specific_hyperparams/new_fine_tuning_dataset.yaml
intent_classes_per_batch*samples_per_class_per_batch
should not be greater than the number of examples in train, dev or test, otherwise the error IndexError: list index out of range
will be raised.PYTHONPATH=. python3 ./src/train.py dataset_specific_hyperparams=new_fine_tuning_dataset model.base_model_name=<the_base_model_you_want_to_fine_tune> model.training_objective=<the_training_objective_you_want_to_use``>
bert-base-multilingual-cased
xlm-roberta-base
sentence-transformers_all-mpnet-base-v2
paraphrase-multilingual-mpnet-base-v2
sentence-transformers_all-mpnet-base-v2
for English-only datasets and paraphrase-multilingual-mpnet-base-v2
for the multilingual onessupervised_learning
loss or the triplet_margin_loss
. Default hyper-parameters should work just fine.when the training finishes due to early stopping, you will find your fine-tuned model in the corresponding folder supervised-intent-clustering/fine_tuned_language_models
. The produce sentence encoder can be directly used with the Hugging Face Sentence Encoder Library:
from sentence_transformers import SentenceTransformer
def get_sentence_embeddings(list_of_sentences: List[str],
hf_model_name_or_path: str=
model = SentenceTransformer(hf_model_name_or_path)
sentence_embeddings = model.encode(
list_of_sentences, batch_size=64, show_progress_bar=True)
return sentence_embeddings
supervised-intent-clustering/experiment_metrics.txt.
In particular, you want the PRAUC after training to be higher then the PRAUC before training on both the train, dev and test sets. If you run multiple experiments, you can turn the experiment_metrics.txt
file into a more readable .csv file running the following command:
PYTHONPATH=. python3 ./src/
training_log_post_processing.pysupervised-intent-clustering/conf/model/default_model.yaml
supervised-intent-clustering/conf/train/default_train.yaml
supervised-intent-clustering/conf/data/default_data.yaml
supervised-intent-clustering/src/train.py