asahi417 / lm-question-generation

Multilingual/multidomain question generation datasets, models, and python library for question generation.
https://www.autoqg.net
MIT License
313 stars 30 forks source link

generated questionq truncated with models of the Mt0 and Mt5 family #7

Closed alainloisel closed 1 year ago

alainloisel commented 1 year ago

While testing French models on hugging face but also on my machine , I found that many of the french models are always trimming the answer in the middle of the text at around 70 to 80 characters . I found that out at least in these models : lmqg/mt5-small-frquad-qg-ae ; lmqg/mt5-small-frquad-qg ; [lmqg/mt5-small-frquad-qag Also for German : mt5-small-dequad-qg . I obtained also the same problem while trying Mt0 models .

I wondered if this is the reason why the reported performances for these models are low...

As an example :

generate question :Le dessus des ailes a une couleur de fond noir opaque. Les ailes antérieures et postérieures sont traversées par une large bande médiane bleu turquoise semi-hyaline qui va de la zone tornale de l'aile postérieure à la zone apicale de l'aile antérieure . Cette bande est plus large en son milieu, plus ou moins verdâtre et maculaire à l'aile antérieure, et la partie de la bande qui traverse les espaces 6, 7 et 8 de l'aile postérieure est blanchâtre. L'aile postérieure comporte par ailleurs une série de minces lunules submarginales bleues Answer : Quelle est la couleur de la bande des ailes antérieures et postérieures ( not completed)

around 80 characters .

asahi417 commented 1 year ago

I overlooked the issue, but it seems pretty interesting finding. We have now a bit more variations of those multilingual models, so I wonder if this still holds with these new models

For the German models, we are aware of their poor quality, and it might be because of the small training instances in German QA dataset compared to other language.