Closed laifuchicago closed 4 years ago
Ok so just to clarify:
Retriever If the issue is really about retrieval using Elasticsearch, you might want to use a different tokenization for indexing the documents there. The tokenization will be the foundation for using BM25 effectively during retrieval. I am not an expert in Thai, but maybe this helps:
Reader In an earlier version of your issue you included also some training code for the reader. How many question-answer pairs do you have in your "squad536_th_fix.json" and how did you create it? Generally, taking the multilingual xlm-roberta model is a good approach. Here's a few directions that could be worth exploring:
@laifuchicago any update on this? Did you manage to resolve the issue?
To author: Yes, I use ICU_tokenizer plugin and solved this issue, thank you for your help. Jonathan Sung
To author: Currently, our team is trying to do the inference in Thai language. But in the retrieving part, Elasticsearch can not retrieve Thai by using Wh questions, it would show nothing. We have tried Korean, russian and Chinese, ES can retrieve them, only Thai can not. Do you have any idea ? Thank you.