Open 99sbr opened 2 years ago
You can change the tokenizer(word splitter) to one that is suitable for your country's language, and then it may not be cut by words, but by characters
i am training the model for 20 epochs, 10 mil sentences, weight decay 0.01 with 3e-5 learning rate. Is this acceptable given i am training on bahasa with indobert as backbone. Any improvement you can suggest here?
Also how to i know when my model converges? can i print loss?
Sounds good. When you change the fit function, you can also print the loss
I have not changed the fit function. These parameters are part of fit function. But in order to print loss what changes do i need to do?
Hello ,
I want to understand what changes i would have to do in order to pretrain TSDAE on bahasa (Indonesian) language. Will the tokenisation and deletion of token work for other languages even if i use 'multilingual' backbone.