Closed SeekPoint closed 7 years ago
You don't need to remove the embedding layer to train a character model. You can simply switch your tokenizer in preprocessing to split by character. Kim et al. still find that character embeddings (rather than sparse coding) is helpful (https://arxiv.org/abs/1508.06615).
The bottom line is, however, that if you want to change the architecture of the DocReader, you can always subclass it, use a different network architecture than defined in the rnn_reader.py, or provide an entirely new class to the pipeline, etc.
"simply switch your tokenizer in preprocessing to split by character " is easy, but should I provide the embedding file like a "Chinese version" glove.840b.300d, and how to make it?
fasttext.cc has Chinese embeddings. We do not provide Chinese training corpora here though. DrQA only supports English.
for example, we want training the character model in some Asia language like Chinese..