Our Model "AlpaCaR" is pronounced as "/ˈælpəˈkɑːr/". The logo is generated by DALL·E 3.
conda create --name car python=3.8
conda activate car
pip install poetry
poetry install
Download IQS or Comet model from Huggingface Link, and save it under /CaR/Ranking/lightning_logs/.
Default setting
python Ranking/split_IQS.py --batch_size=128
Using other instruction file
python Ranking/split_IQS.py --input='XX.json'
'XX.json' needs to be in the format of 'alpaca_data.json'.
Default setting
python Clustering/cluster.py
Using other instruction file with score
python Clustering/cluster.py --input='XX.json'
'XX.json' needs to be in the format of './data/ranking_IQS_data.json'.
Instead of using pretrained models your can train your own model with the following command:
comet-train --cfg configs/models/{your_model_config}.yaml
Specific yaml parameters of IQS
instruction_metric:
class_path: comet.models.InstructionMetric
init_args:
nr_frozen_epochs: 0.3
keep_embeddings_frozen: True
optimizer: AdamW
encoder_learning_rate: 1.0e-06
learning_rate: 1.5e-05
layerwise_decay: 0.95
encoder_model: XLM-RoBERTa
pretrained_model: xlm-roberta-large
pool: avg
layer: mix
layer_transformation: sparsemax
layer_norm: False
loss: mse
dropout: 0.1
batch_size: 8
train_data:
- data/APE_score_train.csv
validation_data:
- data/APE_score_valid.csv
hidden_sizes:
- 2048
- 1024
activations: Tanh
trainer: ../trainer.yaml
early_stopping: ../early_stopping.yaml
model_checkpoint: ../model_checkpoint.yaml
Training data format of IQS can be found under /CaR/Ranking/data/expert-revised, and Comet under /CaR/Ranking/data/expert-revised-comet.
If you find our paper useful, please consider citing:
@article{ge2024clustering,
title={Clustering and Ranking: Diversity-preserved Instruction Selection through Expert-aligned Quality Estimation},
author={Ge, Yuan and Liu, Yilun and Hu, Chi and Meng, Weibin and Tao, Shimin and Zhao, Xiaofeng and Ma, Hongxia and Zhang, Li and Yang, Hao and Xiao, Tong},
journal={arXiv preprint arXiv:2402.18191},
year={2024}
}