UKPLab / sentence-transformers

Multilingual Sentence & Image Embeddings with BERT
https://www.SBERT.net
Apache License 2.0
14.75k stars 2.43k forks source link

How to finetune bert-base-chinese model? #450

Closed ywl0911 closed 3 years ago

ywl0911 commented 3 years ago

Hello, thanks for your great work. I am new here. I have a Chinese similarity dataset like SNLI and I want to finetune the bert-base-chinese basic model with my dataset. I wonder how I can implemente this work? Could someone tell me?

nreimers commented 3 years ago

Have a look here: https://www.sbert.net/docs/training/overview.html

ywl0911 commented 3 years ago

Thank you. I have read this guide and still have some question.

word_embedding_model = models.Transformer('bert-base-uncased', max_seq_length=256)

If the first parameter is "bert-base-chinese", it will automaticly download the basic model from huggingface ? Since my network speed is slow, I download the bert-base-chinese from huggingface manually. There are four files:

bert-base-chinese-config.json bert-base-chinese-modelcard.json bert-base-chinese-vocab.txt bert-base-chinese-pytorch_model.bin

How can I load these file in my code if I want to use bert-base-chinese to finetune?

nreimers commented 3 years ago

Like this:

word_embedding_model = models.Transformer('path/to/folder/with/your/files', max_seq_length=256)

Note, huggingface AutoModel must be able to load these files if the from_pretrained() method is called. I think it requires some different naming, like config.json and pytorch_model.bin.

ywl0911 commented 3 years ago

@nreimers Thanks for your fast replies.

With your instruction,I rename the bert-base-chinese file as follow:

config.json vocab.txt pytorch_model.bin

then put them in the folder model and load the model in my code as follow:

from sentence_transformers import SentenceTransformer, models, SentencesDataset, InputExample, losses
from torch.utils.data import DataLoader
from sentence_transformers import evaluation

word_embedding_model = models.Transformer('./model/bert-base-chinese', max_seq_length=256)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

train_examples = [InputExample(texts=['你好', '你好啊'], label=0.8),
                  InputExample(texts=['昨天', '今天'], label=0.3)]
train_dataset = SentencesDataset(train_examples, model)
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=16)

train_loss = losses.CosineSimilarityLoss(model)

sentences1 = ['This list contains the first column', 'With your sentences', 'You want your model to evaluate on']
sentences2 = ['Sentences contains the other column', 'The evaluator matches sentences1[i] with sentences2[i]', 'Compute the cosine similarity and compares it to scores[i]']
scores = [0.3, 0.6, 0.2]
evaluator = evaluation.EmbeddingSimilarityEvaluator(sentences1, sentences2, scores)

model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=1, warmup_steps=100, evaluator=evaluator, evaluation_steps=500)

I found log says: Didn't fine file ./model/bert-base-chinese/added_tokens.json. We won't load it. Didn't fine file ./model/bert-base-chinese/special_tokens_map.json. We won't load it. Didn't fine file ./model/bert-base-chinese/tokenizer_config.json. We won't load it. Didn't fine file ./model/bert-base-chinese/tokenizer.json. We won't load it. loading file ./model/bert-base-chinese/vocab.txt loading file None loading file None loading file None loading file None My OS is centos7.2, could you tell me this log indicate the bert-base-chinese model is loaded properly or not?

Thanks a lot.

nreimers commented 3 years ago

You should also add the needed tokenizer config to that folder.

ywl0911 commented 3 years ago

@nreimers I am sorry that the bert-base-chinese basic model which downloaded form https://huggingface.co/bert-base-chinese only contains five files as follow:

bert-base-chinese-config.json bert-base-chinese-modelcard.json bert-base-chinese-vocab.txt bert-base-chinese-pytorch_model.bin bert-base-chinese-tf_model.h5

Where should I get the left needed files such as added_tokens.json special_tokens_map.json tokenizer_config.json tokenizer.json ?

And my aims is useing my Chinese similarity dataset to finetune the bert-base-chinese, could tell me how to load the bert-base-chinese model from file? Thanks a lot

nreimers commented 3 years ago

The following code should store you the necessary files:

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("bert-base-chinese")
model = AutoModel.from_pretrained("bert-base-chinese")

tokenizer.save_pretrained('models/bert-base-chinese')
model.save_pretrained('models/bert-base-chinese')

Then you can load it with

sentence_transformers.models.Transformer('models/bert-base-chinese')
ywl0911 commented 3 years ago

Thanks for your reply. According to your suggestion, I get 5 files through transformers package as follow:

config.json vocab.txt pytorch_model.bin special_tokens_map.json tokenizer_config.json

and the log still says:

Didn't fine file ./model/bert-base-chinese/added_tokens.json. We won't load it.
Didn't fine file ./model/bert-base-chinese/tokenizer.json. We won't load it.
loading file ./model/bert-base-chinese/vocab.txt
loading file None
loading file ./model/bert-base-chinese/special_tokens_map.json
loading file ./model/bert-base-chinese/tokenizer_config.json
loading file None

This log means the added_tokens.json and tokenizer.json are still missing.

I download bert-base-nli-stsb-mean-tokens.zip from https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/v0.2/, and I also didn't find the two files in the unzip files of bert-base-nli-stsb-mean-tokens.zip .

So can I guess that the two missing files are unnecessary for loading model from files?

It seems that we only need to put the config.json vocab.txt pytorch_model.bin three files in the fold, the function transformers.AutoTokenizer.from_pretrained("models/bert-base-chinese") and sentence_transformers.models.Transformer('models/bert-base-chinese') can both load the model properly?

nreimers commented 3 years ago

I think tokenizer.json is needed, you should get it when you store AutoTokenizer to disc. But I am not sure.

ywl0911 commented 3 years ago

Hello nreimers~

I take a look into the tokenizer.save_pretrained() function offical implementation here

This function will save four files : tokenizer_config.json special_tokens_map.json vocab.txt added_tokens.json(if existed).

ywl0911 commented 3 years ago

Hi nreimers. Thanks for your explaination, I will close this issue.

llllly26 commented 3 years ago

I guess it can directly use SentenceTransformer('bert-base-chinese'),because the param model_name_or_path – of SentenceTransformer() said: If that fails, tries to construct a model from Huggingface models repository with that name.so i think we can use it directly.because I have tried it and got the result without report error so i guess so and i am not sure.