Closed zarmeen92 closed 4 years ago
The existent models use BERT English, so for any other language it will not produce anything meaningful.
For Urdu you would need a suitable dataset in Urdu. Sadly I don't know which datasets exists for Urdu. But maybe you can construct a triplet dataset from Wikipedia sections?
Thanks @nreimers can you please elaborate what do you mean by triplet dataset?
Hi @zarmeen92 See section 4.4. in this paper: https://arxiv.org/abs/1908.10084
Also see the paper by Dor et al., 2018, how they generates triplets from wikipedia.
General information about triplet loss: https://towardsdatascience.com/siamese-network-triplet-loss-b4ca82c1aec8
Best Nils Reimers
Hi, @nreimers !
I want to load pretrained model from bert main page, but there is no modules.json(and
What should i do to load bert default models?
Best Ruslan Bulgakov
Hi @RusBulgakov You can load the model froms huggingface pytorch-transformers (v1.1.0): https://github.com/huggingface/transformers
You specify the model name (from huggingface), and it will be downloaded and stored in local cache:
word_embedding_model = models.BERT('bert-base-multilingual-cased')
You can find the list of available models here: https://huggingface.co/transformers/pretrained_models.html
Hi @RusBulgakov You can load the model froms huggingface pytorch-transformers (v1.1.0): https://github.com/huggingface/transformers
You specify the model name (from huggingface), and it will be downloaded and stored in local cache:
word_embedding_model = models.BERT('bert-base-multilingual-cased')
You can find the list of available models here: https://huggingface.co/transformers/pretrained_models.html
Oh, Thank you! I will try that.
What about DeepPavlov project's pre-trained models? Is there any method?
Best Ruslan
Hi @RusBulgakov I sadly don't know the mentioned repository.
You can load any model that was created with BERT huggingface (transformer repository).
In the huggingface repo., you also find a tutorial and scripts to convert BERT tensorflow models to pytorch. So if the DeepPavlov models are created in tensorflow, you must first convert them to pytorch and then you can use it with this repository.
Best regards Nils Reimers
Hi,
The multilingual model has some issues when it is imported.
Anybody successfully integrated it?
Thank you
Hi @jacarrasco with
embedder = SentenceTransformer('bert-base-multilingual-cased')
You can only load models that have been created with SentenceTransformer.
To load a model from HuggingFace Transformers, you have to use the following code:
# Use BERT for mapping tokens to embeddings
word_embedding_model = models.BERT('bert-base-multilingual-cased')
# Apply mean pooling to get one fixed sized sentence vector
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(),
pooling_mode_mean_tokens=True,
pooling_mode_cls_token=False,
pooling_mode_max_tokens=False)
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
Now, you can use the multi-lingual model from BERT.
But note, usually BERT out-of-the-box does not yield the best sentence embeddings. Instead, it is recommended to fine-tune the model for your task.
For how to fine models, see the training_*.py scripts in the examples/ folder.
Best Nils Reimers
I just release a model for multiple languages: https://github.com/UKPLab/sentence-transformers/blob/master/docs/pretrained-models/multilingual-models.md
Hi, I have few questions related to models?
Does bert-base-nli-mean-tokens is trained on English-only dataset? I have used this model to get the embeddings of Urdu Language - sentences. It is producing the embeddings of sentences. However, the embeddings are of low quality.
I want to train a Sentence transformer for Urdu Language. The intended task is to perform Clustering. Which type of dataset you suggest for fine tuning, if I train my model using Bert Multilingual Model.