UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.21k stars 2.47k forks source link

TSDAE for Indonesian Language #1558

Open 99sbr opened 2 years ago

99sbr commented 2 years ago

Hello ,

I want to understand what changes i would have to do in order to pretrain TSDAE on bahasa (Indonesian) language. Will the tokenisation and deletion of token work for other languages even if i use 'multilingual' backbone.

ScottishFold007 commented 2 years ago

You can change the tokenizer(word splitter) to one that is suitable for your country's language, and then it may not be cut by words, but by characters

sbrvrm99-zz commented 2 years ago

i am training the model for 20 epochs, 10 mil sentences, weight decay 0.01 with 3e-5 learning rate. Is this acceptable given i am training on bahasa with indobert as backbone. Any improvement you can suggest here?

Also how to i know when my model converges? can i print loss?

nreimers commented 2 years ago

Sounds good. When you change the fit function, you can also print the loss

99sbr commented 2 years ago

I have not changed the fit function. These parameters are part of fit function. But in order to print loss what changes do i need to do?