UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.21k stars 2.47k forks source link

Fine Tuning for Non-English: Dataset for Clustering Task #38

Closed zarmeen92 closed 4 years ago

zarmeen92 commented 5 years ago

Hi, I have few questions related to models?

  1. Does bert-base-nli-mean-tokens is trained on English-only dataset? I have used this model to get the embeddings of Urdu Language - sentences. It is producing the embeddings of sentences. However, the embeddings are of low quality.

  2. I want to train a Sentence transformer for Urdu Language. The intended task is to perform Clustering. Which type of dataset you suggest for fine tuning, if I train my model using Bert Multilingual Model.

nreimers commented 5 years ago

The existent models use BERT English, so for any other language it will not produce anything meaningful.

For Urdu you would need a suitable dataset in Urdu. Sadly I don't know which datasets exists for Urdu. But maybe you can construct a triplet dataset from Wikipedia sections?

zarmeen92 commented 5 years ago

Thanks @nreimers can you please elaborate what do you mean by triplet dataset?

nreimers commented 5 years ago

Hi @zarmeen92 See section 4.4. in this paper: https://arxiv.org/abs/1908.10084

Also see the paper by Dor et al., 2018, how they generates triplets from wikipedia.

General information about triplet loss: https://towardsdatascience.com/siamese-network-triplet-loss-b4ca82c1aec8

Best Nils Reimers

RusBulgakov commented 5 years ago

Hi, @nreimers !

I want to load pretrained model from bert main page, but there is no modules.json(and .bin file)

What should i do to load bert default models?

Best Ruslan Bulgakov

nreimers commented 5 years ago

Hi @RusBulgakov You can load the model froms huggingface pytorch-transformers (v1.1.0): https://github.com/huggingface/transformers

You specify the model name (from huggingface), and it will be downloaded and stored in local cache:

word_embedding_model = models.BERT('bert-base-multilingual-cased')

You can find the list of available models here: https://huggingface.co/transformers/pretrained_models.html

RusBulgakov commented 5 years ago

Hi @RusBulgakov You can load the model froms huggingface pytorch-transformers (v1.1.0): https://github.com/huggingface/transformers

You specify the model name (from huggingface), and it will be downloaded and stored in local cache:

word_embedding_model = models.BERT('bert-base-multilingual-cased')

You can find the list of available models here: https://huggingface.co/transformers/pretrained_models.html

Oh, Thank you! I will try that.

What about DeepPavlov project's pre-trained models? Is there any method?

Best Ruslan

nreimers commented 5 years ago

Hi @RusBulgakov I sadly don't know the mentioned repository.

You can load any model that was created with BERT huggingface (transformer repository).

In the huggingface repo., you also find a tutorial and scripts to convert BERT tensorflow models to pytorch. So if the DeepPavlov models are created in tensorflow, you must first convert them to pytorch and then you can use it with this repository.

Best regards Nils Reimers

jacarrasco commented 4 years ago

Hi,

The multilingual model has some issues when it is imported.

  1. It does not have a file named "module.json", so i had to chenge it manually.
  2. After changing the name of the file i got the following error. File "/home/user/.virtualenvs/news_linker/lib/python3.7/site-packages/sentence_transformers/SentenceTransformer.py", line 74, in init module_class = import_from_string(module_config['type']) TypeError: string indices must be integers

Anybody successfully integrated it?

Thank you

nreimers commented 4 years ago

Hi @jacarrasco with

embedder = SentenceTransformer('bert-base-multilingual-cased')

You can only load models that have been created with SentenceTransformer.

To load a model from HuggingFace Transformers, you have to use the following code:

# Use BERT for mapping tokens to embeddings
word_embedding_model = models.BERT('bert-base-multilingual-cased')

# Apply mean pooling to get one fixed sized sentence vector
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(),
                               pooling_mode_mean_tokens=True,
                               pooling_mode_cls_token=False,
                               pooling_mode_max_tokens=False)

model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

Now, you can use the multi-lingual model from BERT.

But note, usually BERT out-of-the-box does not yield the best sentence embeddings. Instead, it is recommended to fine-tune the model for your task.

For how to fine models, see the training_*.py scripts in the examples/ folder.

Best Nils Reimers

nreimers commented 4 years ago

I just release a model for multiple languages: https://github.com/UKPLab/sentence-transformers/blob/master/docs/pretrained-models/multilingual-models.md