McGill-NLP / llm2vec

Code for 'LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders'
https://mcgill-nlp.github.io/llm2vec/
MIT License
1.11k stars 81 forks source link

Training Script for Sequence-level Tasks #53

Closed louieworth closed 3 months ago

louieworth commented 4 months ago

Thanks for your contribution to LLM2Vec.

  1. Have you released the code training in MTEB (sequence-level tasks)? (I am not so sure whether the code between token-level and sequence-level tasks remains the same.)

  2. Based on your modification to the decode-only Transformers, i.e., Llama, can I jointly train this model with supervised fine-tuning (SFT) via maximum likelihood? OR, the traditional SFT training is not available when training with LLM2Vec, and it must be the next stage of training for this model.

2.1 In training contrastive training approaches in other domains, i.e., CV. They typically first do the unsupervised contrastive learning and then freeze the features and train a supervised linear classifier (a fully connected layer followed by softmax) if the downstream tasks are classification [1]. Will it be the same in LLM2Vec? (similar to #50 )

[1] He, Kaiming, et al. "Momentum contrast for unsupervised visual representation learning." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020.

vaibhavad commented 4 months ago

Thank you for your interest in our work

  1. MTEB is an evaluation set, not a training set. We train on sentence pairs from E5 data, as described in detail here. Our training code, configurations and models are all available.
  2. Our modifications are at architecture level, hence, like any other huggingface model, the model still outputs final hidden states for all input tokens. These final hidden states can be used for SFT training in a similar way as decode-only Transformers. 2.1 That is how we exactly how MTEB evaluates classification tasks. You can see an example in our repo, and also a related discussion about whether it is better to train just the classifier or the entire model (#28).

Let me know if you have any more questions.

louieworth commented 4 months ago

Thanks @vaibhavad

  1. Does LLM2Vec support for other models, e.g., Pythia? I have checked it and it seems like it only supports llama and mistral families.

  2. Just to confirm: if I want to train the model based on my dataset, the correct training code is on experiments/run_supervised.py and I only need to adjust the "dataset_name": "E5".

vaibhavad commented 4 months ago
  1. Currently not, we plan to extend support to many more models in the future. If you are interested in contributing, feel free to open a PR, and I'll be happy to assist you

  2. You'll need to add another file in the datasets folder, which specifies the loading/saving/instructions logic etc. If you are working on a fork, feel free to tag me and I'll be to assist better there.

SGidentification commented 3 months ago

@louieworth, the citation for [1] is not mentioned. Could you provide thr reference to it? I am interested in reading the mentioned paper.

louieworth commented 3 months ago

I've updated the reference. Please check on it.