facebookresearch / DrQA

Reading Wikipedia to Answer Open-Domain Questions
Other
4.48k stars 898 forks source link

how can we remove embedding layer ... #16

Closed SeekPoint closed 7 years ago

SeekPoint commented 7 years ago

for example, we want training the character model in some Asia language like Chinese..

ajfisch commented 7 years ago

You don't need to remove the embedding layer to train a character model. You can simply switch your tokenizer in preprocessing to split by character. Kim et al. still find that character embeddings (rather than sparse coding) is helpful (https://arxiv.org/abs/1508.06615).

The bottom line is, however, that if you want to change the architecture of the DocReader, you can always subclass it, use a different network architecture than defined in the rnn_reader.py, or provide an entirely new class to the pipeline, etc.

SeekPoint commented 7 years ago

"simply switch your tokenizer in preprocessing to split by character " is easy, but should I provide the embedding file like a "Chinese version" glove.840b.300d, and how to make it?

ajfisch commented 7 years ago

fasttext.cc has Chinese embeddings. We do not provide Chinese training corpora here though. DrQA only supports English.