cdqa-suite / cdQA

⛔ [NOT MAINTAINED] An End-To-End Closed Domain Question Answering System.
https://cdqa-suite.github.io/cdQA-website/
Apache License 2.0
616 stars 191 forks source link

How to use SQuAD for chinese (Close-Domain)QA task #308

Open weinixuehao opened 4 years ago

weinixuehao commented 4 years ago

I have three questions First: Can i directly use SQuAD for chinese (Close-Domain)QA task?

Second: Is it the best solution to use run_squda.py to fine tune bert model with chinese dataset which format same as SQuAD dataset? if "First" is not possible!

ps: Where to look for chinese dataset same as SQuAD dataset If I finally use the second solution?

andrelmfarias commented 4 years ago

Hi, Answering to your questions:

First: Can i directly use SQuAD for chinese (Close-Domain)QA task?

I don't really understand it, SQuAD is a QA dataset in English, you would need a "Chinese" version of a QA dataset. Maybe your question is if you can use BERT for Chinese? If it is, you should be trying BERT multilingual, I am not sure about its performance though...

Second: Is it the best solution to use run_squda.py to fine tune bert model with chinese dataset which format same as SQuAD dataset? if "First" is not possible!

To fine-tune bert model with a chinese dataset, I advise you to use the run_squad.py example in the Hugging Face's repository with the bert-base-multilingual-(un)cased version.

ps: Where to look for chinese dataset same as SQuAD dataset If I finally use the second solution?

Unfortunately, I don't have an answer to this question 😞

fmikaelian commented 4 years ago

Hi @weinixuehao

You can use cdQA in chinese, but it requires some additional work. The idea is to:

Then you should be able to do closed-domain QA on your own chinese documents.

weinixuehao commented 4 years ago

@andrelmfarias @fmikaelian Thanks to answer my question! This is what i need.

weinixuehao commented 4 years ago

Hi @fmikaelian SQuAD(around 30M) dataset size less than DuReader dateset(around 1~2G per file) Need I convert all DuReader dataset to SQuAD-like dataset to train? May be it takes much time to convert and train.