How to use SQuAD for chinese (Close-Domain)QA task

cdqa-suite / cdQA

⛔ [NOT MAINTAINED] An End-To-End Closed Domain Question Answering System.

https://cdqa-suite.github.io/cdQA-website/

Apache License 2.0

616 stars 191 forks source link

How to use SQuAD for chinese (Close-Domain)QA task #308

Open weinixuehao opened 4 years ago

weinixuehao commented 4 years ago

I have three questions First: Can i directly use SQuAD for chinese (Close-Domain)QA task?

Second: Is it the best solution to use run_squda.py to fine tune bert model with chinese dataset which format same as SQuAD dataset? if "First" is not possible!

ps: Where to look for chinese dataset same as SQuAD dataset If I finally use the second solution?

andrelmfarias commented 4 years ago

Hi, Answering to your questions:

First: Can i directly use SQuAD for chinese (Close-Domain)QA task?

I don't really understand it, SQuAD is a QA dataset in English, you would need a "Chinese" version of a QA dataset. Maybe your question is if you can use BERT for Chinese? If it is, you should be trying BERT multilingual, I am not sure about its performance though...

Second: Is it the best solution to use run_squda.py to fine tune bert model with chinese dataset which format same as SQuAD dataset? if "First" is not possible!

To fine-tune bert model with a chinese dataset, I advise you to use the run_squad.py example in the Hugging Face's repository with the bert-base-multilingual-(un)cased version.

ps: Where to look for chinese dataset same as SQuAD dataset If I finally use the second solution?

Unfortunately, I don't have an answer to this question 😞

fmikaelian commented 4 years ago

Hi @weinixuehao

You can use cdQA in chinese, but it requires some additional work. The idea is to:

Find a SQuAD-like dataset in Chinese. It should have the same json schema as the SQuAD. For example you could use the DuReader QA dataset released by Baidu but you might need to convert it to SQuAD format.
Use our notebook to train the reader on your chinese SQuAD-like dataset. You should instantiate the BERT classes with the chinese pre-trained language model bert-base-chinese, then fine-tune on your chinese SQuAD-like dataset.
Once your reader is built, you can couple it with a retriever that is adapted to chinese language (chinese tokenizer, chinese stopwords, etc...)

Then you should be able to do closed-domain QA on your own chinese documents.

weinixuehao commented 4 years ago

@andrelmfarias @fmikaelian Thanks to answer my question! This is what i need.

weinixuehao commented 4 years ago

Hi @fmikaelian SQuAD(around 30M) dataset size less than DuReader dateset(around 1~2G per file) Need I convert all DuReader dataset to SQuAD-like dataset to train? May be it takes much time to convert and train.