UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.81k stars 2.43k forks source link

Custom Vocabulary #36

Closed mustfkeskin closed 4 years ago

mustfkeskin commented 4 years ago

Hello My Problem based on product title similarity. General vocabulary don't fit my case How i can change vocabulary

Thanks

nreimers commented 4 years ago

Which model do you use?

BERT uses a fixed tokenizer, changing there the vocab is not really possible (it would require re-training the complete BERT model, which would takes ages and has enormous GPU requirements).

But usually I think there is not really a need to create a new BERT model with custom vocab. Maybe just fine-tune BERT on your data first before using it for sentence embeddings.

Otherwise, with the average word embeddings models, you can just use any word embeddings models that you like.

nicolasesprit commented 4 years ago

Interesting lecture from BERT repo : Don't expect crazy performance boosts by adjusting the vocab. We so far got 2-3 % out of it. However, the whole fine-tuning on domain corpus can have quite some good impact

mustfkeskin commented 4 years ago

For example i give 2 e-commerce title than i want to get similartiy score In my case i think it is not suitable for any natural language modelling

1)MSI Gaming GL73 9SD-216 Black Notebook 43.9 cm (17.3") 1920 x 1080 Pixels 2.6 GHz 9th Gen Intel Core i7 i7-9750H

2)MSI GP73 Leopard 8RF 17.3-Inch FHD Laptop - (Black) (Intel i7-8750H Processor, 16 GB RAM, 256 GB SSD, 1 TB HDD, GeForce GTX1070, 6GB GDDR5 Graphics, Windows 10 Home)

I found this repo for build custom vocabulary --> bert-vocab-builder How i can change vocab.txt --> examples/train-nli-bert.py case

nreimers commented 4 years ago

As mentioned, when you train BERT from scratch, you would need gigabytes of these titles and it would require weeks / months on a modern mutli-GPU-cluster until you reach a good quality. Not sure if that is practical.

You can try to fine-tune existent BERT model on your data.

If this does not work, I would rather train traditional word2vec model on your data instead of trying to train BERT from scratch.

mustfkeskin commented 4 years ago

Thanks to everyone, I'm trying to finetune BERT STS.

shaktisd commented 4 years ago

Hi @nreimers As per your recommendation above I finetuned by bert model with my domain specific text, using bert's run_pretraining.py script. Now how do I use my fine-tuned model in sentence-transformer , If you can please help me with a few lines of code that are required to use a custom fine tuned bert model in sentence-transformer

nreimers commented 4 years ago

Hi @shaktisd Loading your model is quite easy:

word_embedding_model = models.BERT('path/to/your/model/stored/using/hugginface/functions')

pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(),
                               pooling_mode_mean_tokens=True,
                               pooling_mode_cls_token=False,
                               pooling_mode_max_tokens=False)

model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
bjayaraman29 commented 1 year ago

@mustfkeskin , Did you find any solution? I am also in same scenario where I need to find similarity score between 2 sentence but my data won't be working on general pre-trained dataset.

mustfkeskin commented 1 year ago

I proceeded with this problem in a different way without transformers. I couldn't find a solution for this repo @bjayaraman29

bjayaraman29 commented 1 year ago

Hi @mustfkeskin , Could you please tell me how you proceeded this? Atleast high level context will be useful for me.

mustfkeskin commented 1 year ago

You can start with a character-based siamese networking training. If it doesn't work, you go to more complex methods.