UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.1k stars 2.46k forks source link

Integration with `huggingface` trainer, or even direct `pytorch` training #2627

Closed ydennisy closed 4 months ago

ydennisy commented 5 months ago

Hello @tomaarsen

Firstly sorry to ping you directly, but also a big thank you for your and the other contributor work on this project!

What I am going to ask about is not the first time it has been asked but I wanted to bring this back to your attention.

I feel sentence-transformers is an excellent library for inference and quick prototyping when you need embeddings, but as soon as any fine tuning or model changes are needed I feel the API is clunky, mainly because it is non-standard to more established tooling. So in short ideally I would like to be able to use the HF trainer and also a direct pytorch training loop to fine tune and analyse models.

Is there any reason this is not something you feel would be very valuable?

Happy to elaborate on the reasons, but this is mainly to do with tracking metrics such as loss in tooling for example w&b.

Thanks in advance! D

tomaarsen commented 5 months ago

Hello!

I have a great surprise for you: a v3 pre-release with essentially your proposed plan is already ready, just waiting on additional documentation before it's released. There isn't much documentation on it yet, but this will be the general training loop:

from datasets import load_dataset
from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer
from sentence_transformers.losses import MultipleNegativesRankingLoss

# 1. Load a model to finetune
model = SentenceTransformer("microsoft/mpnet-base")

# 2. Load a dataset to finetune on
dataset = load_dataset("sentence-transformers/all-nli", "pair")
train_dataset = dataset["train"]
eval_dataset = dataset["dev"]

# 3. Define a loss function
loss = MultipleNegativesRankingLoss(model)

# 4. Create a trainer & train
trainer = SentenceTransformerTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss=loss,
)
trainer.train()

# 5. Save the trained model
model.save("models/mpnet-base-all-nli")

The new SentenceTransformerTrainer subclasses the HF Trainer, so training should be very familiar if you know how that Trainer works. See #2449 for more info on the new training loop. So, yes, this new Trainer has direct integrations with W&B and Tensorboard. It also introduces training & evaluation loss logging, which has been missing.

Additionally, this message has 3 advanced training scripts and this message has 2 advanced training scripts. Also, #2622 has a bunch more training scripts.

Here are some example models produced by these training scripts:


As for the

a direct pytorch training loop to fine tune and analyse models.

I think I will leave this to the "advanced users", as some people tend to prefer to train "their way". That will continue to be possible, albeit perhaps with some hacks. There are some challenges with the current API of a SentenceTransformer object that I can't change without pretty major repercussions in third party applications that rely on Sentence Transformers.

harry7171 commented 5 months ago

Hi @tomaarsen . This is great, was going to raise an issue for the same.

Can you let me know when do we have it (version with SentenceTransformerTrainer) released, as I am planning to use it in current ongoing workstream.

Thanks in advance

tomaarsen commented 5 months ago

Hello!

The current goal is to release in around 1.5-2 weeks. All that remains for now is some bugfixing & (re)writing documentation. The v3.0-pre-release branch already closely resembles what will eventually be released.

ydennisy commented 4 months ago

@tomaarsen I see the release has happened this is amazing news and a huge thank you to you and anyone else involved!

I think this issue can be closed :)

tomaarsen commented 4 months ago

Gladly! I hope you enjoy working with it 🤗