artitw / text2text

Text2Text: Crosslingual NLP/G toolkit
https://discord.gg/eHaaUuWpTc
Other
278 stars 33 forks source link

Question generation for non English languages #5

Open secsrexion opened 4 years ago

secsrexion commented 4 years ago

Hello I hope you are doing fine , firstly i thank you for your contributions on question generation , and i have a question if i may ask . Im trying to build a question generation system for a non-English language i was planing to use UniLm ( miniLm multilingual version ) because bert is not really built for text generation since you have experience on that what how do you suggest to do that and am i following the good path .

Thank you in advance for your appreciated help !

artitw commented 4 years ago

What language are you considering? Try looking up Cross-Lingual Natural Language Inference (XNLI) and Cross-Lingual Question Answering (MLQA) to fine tune miniLm. If you require something different, consider procuring your own dataset for fine tuning.

secsrexion commented 4 years ago

Hello Im just having some troubles to create the top layer To make seq2seq generation. Uf you could just explain it with some details how can i create it on my own that will be great .

Sent from my iPhone

On 3 May 2020, at 00:03, artitw notifications@github.com wrote:



What language are you considering? Try looking up Cross-Lingual Natural Language Inference (XNLI) and Cross-Lingual Question Answering (MLQA) to fine tune miniLm. If you require something different, consider procuring your own dataset for fine tuning.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/artitw/text2text/issues/5#issuecomment-623026958, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJEMYYKBBTDW2UODBBPMXOTRPSREZANCNFSM4MTUY2WA.

artitw commented 4 years ago

Why do you have to create the top layer? Are you using the miniLM code available?

secsrexion commented 4 years ago

I want understand and re build it my own , right now the seq2seq code of minilm is not adapted to the multilingual version . And if the multilingual gives some hood results i may pre train another version for this language.

Thanks for your replies i really appriciate your help

artitw commented 4 years ago

If you’re looking to customize the question generation component, take a look at https://github.com/artitw/text2text/blob/db07ee9814d0e360774e62b5d697736e7f7d6715/text2text/pytorch_pretrained_bert/modeling.py#L2107

However, what aspects of your multilingual approach would require adaptation?

secsrexion commented 4 years ago

The main problem is that the multilingual version is not as good as the native one , and NLG is a data-hungry task as you know .

artitw commented 4 years ago

My impression is that you could use more training data for what you’re trying to achieve. Am I missing something?

thusithaC commented 4 years ago

Hi @secsrexion How is your progress with non-english question generation. We are also interested in Chinese language QG task, and wondering how much work we might have to put to adapt the code provided by artitw.

BTW great work and thanks @artitw for sharing the code! Have you published your work anywhere?

secsrexion commented 4 years ago

Hello @thusithaC i had to stop the developpement i was using a machine translated version of SQUAD and i discovered latter that it was a low quality one . now i'm trying to gather up a good dataset to continue .

generally from what i found , using a multi-langual version of UNILM is a bad choice due to the lack of a rich repliable dataset in the training process , i was getting 80% from each phrase marked as UNK by the tokenizer . i didnt tested it for Chinese i hope you will find better results , but if you want a good piece of advice we need to find/re-build the training code of UNILM to create a native version of the language model .

secsrexion commented 4 years ago

@artitw i'm sorry for my late answer .

i was facing troubles with the multi-langual version and the quality of the dataset

now i'm trying to develop a riable dataset for arabic QST/ANS , and i'm searching for a way to train new native version of the UNILM , any ideas ?

thusithaC commented 4 years ago

@secsrexion Thanks for the reply. I saw your post on the UNILM github as well :) I wonder whether is quality issue you face is because the multi-lang modelsis based on "miniLM" i.e. the smaller model but this code-base is based on the full english unilm model, which is vastly superior?

secsrexion commented 4 years ago

Hi I think its because of the small amount of the arabic training data they used in first place

jacampo commented 3 years ago

Hi, Is it possible to use the model in spanish? If not, how could i train the program?

artitw commented 3 years ago

@thusithaC @secsrexion @jacampo I am looking into making a multilingual model to see if and how it can be done. As @secsrexion pointed out, the low amounts of data need to be addressed. I will keep you all updated.

thusithaC commented 3 years ago

Awesome! thanks,

On Sun, Sep 13, 2020 at 12:33 AM artitw notifications@github.com wrote:

@thusithaC https://github.com/thusithaC @secsrexion https://github.com/secsrexion @jacampo https://github.com/jacampo I am looking into making a multilingual model to see if and how it can be done. As @secsrexion https://github.com/secsrexion pointed out, the low amounts of data need to be addressed. I will keep you all updated.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/artitw/text2text/issues/5#issuecomment-691513429, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVRIJ2VDBVST2CQE52MFWDSFOPEJANCNFSM4MTUY2WA .

-- Thusitha Chandrapala

artitw commented 3 years ago

An alternative solution if you need something immediately is to translate to English and then use this model.

jacampo commented 3 years ago

@artitw Ok, thanks. The translation could be interresting too. But what if i had a dataset and a bert model in another languaje? The structure of the code should be the same, or is there any other difference between languages?

artitw commented 3 years ago

@jacampo which BERT model are you referring to? If it uses WordPiece tokenization, I cannot think of any differences in the code used.

jacampo commented 3 years ago

I found one in spanish: https://github.com/dccuchile/beto But i dont know if it can be done with it.

You use BertForSeq2SeqDecoder, rigth? What is the diference between that and BertModel or BertForPreTraining? Sorry for disturbing so much with so many questions

Edit: Ok i see you start with bert-base-cased, so my question is resolved, It is too much information at once, do you recommend a simple guide to understand the models and how to use it?

artitw commented 3 years ago

@jacampo glad you figured it out. Yes, it is indeed confusing. It sounds like you would find a fine-tuning guide useful. I can think about how that might be done. In the meantime, if you find something that works please share back here.

jacampo commented 3 years ago

@artitw thanks, i'll let you know

artitw commented 3 years ago

Anyone want to work together on this? I’ve started an approach to multilingual question generation and summarization but not had enough time to run experiments. I could provide some guidance for anyone interested in collaborating, as long as the work is contributed back to open source here. The approach would be based on cross-lingual models as I describe here: https://www.youtube.com/watch?v=caZLVcJqsqo

artitw commented 3 years ago

Multilingual question generation is now available. Check out the latest version

from text2text import Questioner
qr = Questioner()
qr.predict(["很喜欢陈慧琳唱歌。"], src_lang='zh')
[('我喜欢做什么?', '唱歌')]