Closed Parvez2017 closed 3 years ago
I've been working with a MS student who's been doing French. It works okay just fine tuning on French data. Korean probably wouldn't work that well though. You'd need to train a new model from scratch on that or similar.
mBART might be more up your alley for this.
This issue has not had activity in 30 days. Please feel free to reopen if you have more issues. You may apply the "never-stale" tag to prevent this from happening.
hi @stephenroller :) I want to ask about Turkish language :) I want to fine tune on Turkish data, could you guide like which model is more suitable for this purpose ? Thanks for your time
My student and I had success with french with the blenderbot 90M model, but that was somewhat chosen for resource constraint reasons. I would suggest giving BlenderBot400 or larger a shot, or otherwise try -m hugging_face/gpt2
.
-m bart
could also work well. We looked into porting mBART but never got around to it. The main thing is we need someone to add SentencePiece tokenization to ParlAI.
Okey, Thanks a lot for your response, I got it and I'm gonna try for Turkish with adding turkish tokenizer :) @stephenroller
@stephenroller I want to use hugginface tokenizer that has usage like; from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-turkish-cased") Where can I place this function? Could you guide me please :) İf I just add a tr_tokenize func https://github.com/facebookresearch/ParlAI/blob/d191b8063a8b8500737f73b658188d393cc96211/parlai/core/dict.py#L391-L402 like that , does it work? For example ;
def
tr_tokenize(self,text):
self.tr_tokenizer=tr_tokenizer
return self.tr_tokenizer.tokenize(text)
hi @stephenroller, can you suggest some steps need to train a blenderbot with other languages such as vietnamese
, japanese
,... ? or suggest other agents for this task?
Ref, https://github.com/facebookresearch/ParlAI/issues/2830#issuecomment-656860542
I having a vietnamese
dataset and I want to learn about blenderbot.
hi again @stephenroller when I tried to fine tune bart model with turkish wikipedia data it gives error like
RuntimeError: CUDA error: device-side assert triggered
That usually happens when you have an out-of-bounds error on one of your embedding inputs. Could be either position embeddings (your truncate arguments are too long) or embeddings (something is wrong with the dict)
okey I am gonna check it again
Use this to open other questions or issues, and provide context here. Is there any work going on to use Blenderbot for dialogue generation for other languages like Korean and Japanese?
I am trying to implement the Blenderbot strategy to train a model on Korean language. Can I retrain Blenderbot from scratch?
Is there any implementation idea for Blenderbot in other language?
Please help me out.