facebookresearch / ParlAI

A framework for training and evaluating AI models on a variety of openly available dialogue datasets.
https://parl.ai
MIT License
10.49k stars 2.1k forks source link

Using blenderbot for other language #3294

Closed Parvez2017 closed 3 years ago

Parvez2017 commented 3 years ago

Use this to open other questions or issues, and provide context here. Is there any work going on to use Blenderbot for dialogue generation for other languages like Korean and Japanese?

I am trying to implement the Blenderbot strategy to train a model on Korean language. Can I retrain Blenderbot from scratch?

Is there any implementation idea for Blenderbot in other language?

Please help me out.

stephenroller commented 3 years ago

I've been working with a MS student who's been doing French. It works okay just fine tuning on French data. Korean probably wouldn't work that well though. You'd need to train a new model from scratch on that or similar.

mBART might be more up your alley for this.

github-actions[bot] commented 3 years ago

This issue has not had activity in 30 days. Please feel free to reopen if you have more issues. You may apply the "never-stale" tag to prevent this from happening.

Hilal-Urun commented 3 years ago

hi @stephenroller :) I want to ask about Turkish language :) I want to fine tune on Turkish data, could you guide like which model is more suitable for this purpose ? Thanks for your time

stephenroller commented 3 years ago

My student and I had success with french with the blenderbot 90M model, but that was somewhat chosen for resource constraint reasons. I would suggest giving BlenderBot400 or larger a shot, or otherwise try -m hugging_face/gpt2.

stephenroller commented 3 years ago

-m bart could also work well. We looked into porting mBART but never got around to it. The main thing is we need someone to add SentencePiece tokenization to ParlAI.

Hilal-Urun commented 3 years ago

Okey, Thanks a lot for your response, I got it and I'm gonna try for Turkish with adding turkish tokenizer :) @stephenroller

Hilal-Urun commented 3 years ago

@stephenroller I want to use hugginface tokenizer that has usage like; from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-turkish-cased") Where can I place this function? Could you guide me please :) İf I just add a tr_tokenize func https://github.com/facebookresearch/ParlAI/blob/d191b8063a8b8500737f73b658188d393cc96211/parlai/core/dict.py#L391-L402 like that , does it work? For example ;

def tr_tokenize(self,text): self.tr_tokenizer=tr_tokenizer return self.tr_tokenizer.tokenize(text)

lh0x00 commented 3 years ago

hi @stephenroller, can you suggest some steps need to train a blenderbot with other languages such as vietnamese, japanese,... ? or suggest other agents for this task?

lh0x00 commented 3 years ago

Ref, https://github.com/facebookresearch/ParlAI/issues/2830#issuecomment-656860542 I having a vietnamese dataset and I want to learn about blenderbot.

Hilal-Urun commented 3 years ago

hi again @stephenroller when I tried to fine tune bart model with turkish wikipedia data it gives error like RuntimeError: CUDA error: device-side assert triggered

stephenroller commented 3 years ago

That usually happens when you have an out-of-bounds error on one of your embedding inputs. Could be either position embeddings (your truncate arguments are too long) or embeddings (something is wrong with the dict)

Hilal-Urun commented 3 years ago

okey I am gonna check it again