asahi417 / lm-question-generation

Multilingual/multidomain question generation datasets, models, and python library for question generation.
https://www.autoqg.net
MIT License
313 stars 30 forks source link

Adding new languages #20

Open asahi417 opened 10 months ago

asahi417 commented 10 months ago

Here's a thread to add more languages to lmqg as well as https://autoqg.net/ . If you would like to contribute, please comment here with a potential QA dataset we can use to train QAG model on the language. We need at least 10k QA pairs for model training.

eg) Language: Turkish Dataset: https://github.com/TQuad/turkish-nlp-qa-dataset Size: 8308

asahi417 commented 10 months ago

Language: Bengali Dataset: https://huggingface.co/datasets/csebuetnlp/squad_bn Size: 127,771/2,502/2,504

asahi417 commented 10 months ago

Language: Chinese Dataset: https://github.com/junzeng-pluto/ChineseSquad

asahi417 commented 10 months ago

Language: Chinese Dataset: https://github.com/junzeng-pluto/ChineseSquad

Chinese QAG is available on https://autoqg.net/ and lmqg now! With lmqg, you can use it as below.

from lmqg import TransformersQG

model = TransformersQG(language="zh")
context = "与转导或结合不同,转化依赖于大量的细菌基因产物,这些基因产物专门相互作用来完成这个复杂的过程,因此转化显然是细菌对DNA转移的适应。为了使细菌结合、吸收供体DNA并将其重组为自己的染色体,它必须首先进入一种称为能力的特殊生理状态(见自然能力)。在枯草芽孢杆菌中,大约40个基因是培养能力所必需的。枯草芽孢杆菌转化过程中转移的DNA长度可以在染色体的三分之一到整个染色体之间。转化在细菌物种中似乎很常见,到目前为止,已知至少有60种物种具有自然转化能力。自然界能力的发展通常与应激性环境条件有关,似乎是一种促进受体细胞DNA损伤修复的适应。"
model.generate_qa(context)
[('在染色体中发现的DNA长度是多少?', '枯草芽孢杆菌转化过程中转移的DNA长度可以在染色体的三分之一到整个染色体之间。')]
pawanGithub10 commented 9 months ago

Language: Hindi Dataset:(https://github.com/google-deepmind/xquad/blob/master/xquad.hi.json) Please tell me in detail what activities to be done to contribute.

asahi417 commented 9 months ago

Language: Hindi Dataset:(https://github.com/google-deepmind/xquad/blob/master/xquad.hi.json) Please tell me in detail what activities to be done to contribute.

This is too small. I checked the dataset and there're 1190 QA pairs in total. Ideally, there should be around 10k pairs, as we are going to train relatively small models (~300M).