asahi417 / lm-question-generation

Multilingual/multidomain question generation datasets, models, and python library for question generation.
https://www.autoqg.net
MIT License
313 stars 30 forks source link

Do you have a plan to support Chinese? #11

Closed yanshuaibupt closed 10 months ago

asahi417 commented 1 year ago

Hi, we are eager to extend the languages we support. In fact, as long as there is a good QA dataset in the language, we can train QAG model on that language easily, but for Chinese, we didn't find a good QA dataset so far, so that's the current bottleneck.

AntonOfTheWoods commented 1 year ago

@yanshuaibupt , I am also very interested in doing this for Chinese, so if you would like to collaborate on this then get in touch!

Antlerkeke commented 11 months ago

Hi, we are eager to extend the languages we support. In fact, as long as there is a good QA dataset in the language, we can train QAG model on that language easily, but for Chinese, we didn't find a good QA dataset so far, so that's the current bottleneck.

May you can use this:https://huggingface.co/datasets/lijingxin/squad_zen

asahi417 commented 10 months ago

Hi, we are eager to extend the languages we support. In fact, as long as there is a good QA dataset in the language, we can train QAG model on that language easily, but for Chinese, we didn't find a good QA dataset so far, so that's the current bottleneck.

May you can use this:https://huggingface.co/datasets/lijingxin/squad_zen

Thanks for sharing the resource! I'll have a look the dataset, and soon start training if it's in the same format as the other datasets!

asahi417 commented 10 months ago

We are training QAG models on the Chinese SQuAD, and they should be ready by the end of this week!

asahi417 commented 10 months ago

FYI, these are the datasets formatted into our QG and QAG dataset. https://huggingface.co/datasets/lmqg/qag_zhquad https://huggingface.co/datasets/lmqg/qg_zhquad

asahi417 commented 10 months ago

Hi, I'll close this thread and open another one specific for the new language request, so please follow that one instead.

https://github.com/asahi417/lm-question-generation/issues/20

Meanwhile Chinese QAG is already available on https://autoqg.net/ and lmqg now! With lmqg, you can use it as below.

from lmqg import TransformersQG

model = TransformersQG(language="zh", model_ae="lmqg/mt5-small-zhquad-ae", model="lmqg/mt5-small-zhquad-qg")
context = "与转导或结合不同,转化依赖于大量的细菌基因产物,这些基因产物专门相互作用来完成这个复杂的过程,因此转化显然是细菌对DNA转移的适应。为了使细菌结合、吸收供体DNA并将其重组为自己的染色体,它必须首先进入一种称为能力的特殊生理状态(见自然能力)。在枯草芽孢杆菌中,大约40个基因是培养能力所必需的。枯草芽孢杆菌转化过程中转移的DNA长度可以在染色体的三分之一到整个染色体之间。转化在细菌物种中似乎很常见,到目前为止,已知至少有60种物种具有自然转化能力。自然界能力的发展通常与应激性环境条件有关,似乎是一种促进受体细胞DNA损伤修复的适应。"
model.generate_qa(context)
[('在染色体中发现的DNA长度是多少?', '枯草芽孢杆菌转化过程中转移的DNA长度可以在染色体的三分之一到整个染色体之间。')]

We're training a few more models that can potentially improve the quality, and will announce it once they are ready.