UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.06k stars 2.45k forks source link

how can i save fine_tuned cross-encoder to HF and then download it from HF #2499

Open satyrmipt opened 7 months ago

satyrmipt commented 7 months ago

I'm looking for ways to share fine-tuned cross-encoder with my teacher. Cross encoder model does not have native push_to_hub() method. So i decided to use general approach:

from transformers import AutoModelForSequenceClassification
import torch

# read from disk, model was saved as ft_model.save("model/crerankingeval-30e-4000-ms-marco-MiniLM-L-6-v2")
cross_ft_model = AutoModelForSequenceClassification.from_pretrained("model\\crerankingeval-30e-4000-ms-marco-MiniLM-L-6-v2")
# push to hub
cross_ft_model.push_to_hub("satyroffrost/crerankingeval-30e-4000-ms-marco-MiniLM-L-6-v2")

Now model is available on HF. Commit info was like: CommitInfo(commit_url='https://huggingface.co/satyroffrost/crerankingeval-30e-4000-ms-marco-MiniLM-L-6-v2/commit/d81fe317cb037940e09db256d8a0926e80c358e5', commit_message='Upload BertForSequenceClassification', commit_description='', oid='d81fe317cb037940e09db256d8a0926e80c358e5', pr_url=None, pr_revision=None, pr_num=None)

then i decided to ensure the model is workable:

cross_ft_model = CrossEncoder("satyroffrost/crerankingeval-30e-4000-ms-marco-MiniLM-L-6-v2")
cross_ft_model.predict([('SentenceTransformer is well-documented library','but saving crossencoder to HF is a bit tricky')])

and get the error:

_Traceback (most recent call last):

Cell In[18], line 1 cross_ft_model = CrossEncoder("satyroffrost/crerankingeval-30e-4000-ms-marco-MiniLM-L-6-v2")

File ~\anaconda3\Lib\site-packages\sentence_transformers\cross_encoder\CrossEncoder.py:72 in init self.tokenizer = AutoTokenizer.from_pretrained(model_name, **tokenizer_args)

File ~\anaconda3\Lib\site-packages\transformers\models\auto\tokenization_auto.py:745 in from_pretrained return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)

File ~\anaconda3\Lib\site-packages\transformers\tokenization_utils_base.py:1838 in from_pretrained raise EnvironmentError(

OSError: Can't load tokenizer for 'satyroffrost/crerankingeval-30e-4000-ms-marco-MiniLM-L-6-v2'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'satyroffrost/crerankingeval-30e-4000-ms-marco-MiniLM-L-6-v2' is the correct path to a directory containing all relevant files for a BertTokenizerFast tokenizer._

I compare local model folder and uploaded HF model files, last ones don't include tokenizer files. Uploaded model don't work on HF too. How can i correctly upload model with tokenizer to HF and the use it from HF like model = CrossEncoder(path_to_hf)?

tomaarsen commented 7 months ago

Hello!

Indeed, the CrossEncoder is currently missing a push_to_hub feature, apologies for that. This is how you can push your model:


from sentence_transformers import CrossEncoder

model = CrossEncoder("the/path/to/my/local/model")
# An example repo_id:
repo_id = "tomaarsen/my_cross_encoder"
model.model.push_to_hub(repo_id)
model.tokenizer.push_to_hub(repo_id)

And then you can load your model like so:

from sentence_transformers import CrossEncoder

repo_id = "tomaarsen/my_cross_encoder"
model = CrossEncoder(repo_id)

I will be adding proper push_to_hub functionality in the future.

satyrmipt commented 7 months ago

I managed to push tokenizer to HF separately, full code is a bit long:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# read local model
cross_ft_model = AutoModelForSequenceClassification.from_pretrained("model\\crerankingeval-30e-4000-ms-marco-MiniLM-L-6-v2")

# push local model to hub (tokenizer would not be uploaded to HF)
cross_ft_model.push_to_hub("satyroffrost/crerankingeval-30e-4000-ms-marco-MiniLM-L-6-v2")

cross_ft_model_tokenizer = AutoTokenizer.from_pretrained("model\\crerankingeval-30e-4000-ms-marco-MiniLM-L-6-v2")
# push tokenizer separately to hub
cross_ft_model_tokenizer.push_to_hub("satyroffrost/crerankingeval-30e-4000-ms-marco-MiniLM-L-6-v2")

# check if HF model works:
cross_ft_model = CrossEncoder("satyroffrost/crerankingeval-30e-4000-ms-marco-MiniLM-L-6-v2")
print(cross_ft_model.predict([("Push model","to HuggingFace")]))
tomaarsen commented 7 months ago

That should indeed be equivalent. Looks good!

johneckberg commented 7 months ago

@tomaarsen

I'm happy to work on this if no one else has volunteered!

tomaarsen commented 7 months ago

That would be great!

bterrific2008 commented 2 months ago

It looks like this issue was resolved by #2524.

Is there any interest in adding a test case for the CrossEncoder.push_to_hub? I don't think it needs to be anything as verbose as the SentenceTransformer.push_to_hub test case, since CrossEncoder.push_to_hub is a wrapper around PushToHubMixin.push_to_hub which is respectively tested in the HF transformers package.