Elasticsearch can not retrieve Thai language

laifuchicago commented 4 years ago

To author: Currently, our team is trying to do the inference in Thai language. But in the retrieving part, Elasticsearch can not retrieve Thai by using Wh questions, it would show nothing. We have tried Korean, russian and Chinese, ES can retrieve them, only Thai can not. Do you have any idea ? Thank you. current problem of th

tholor commented 4 years ago

Ok so just to clarify:

you are using the ElasticsearchRetriever?!
you index Thai documents to Elasticsearch
when you call retriever.retrieve(question=) you don't get any results?

Is the reader working well (e.g. on single passages without retriever)?

Retriever If the issue is really about retrieval using Elasticsearch, you might want to use a different tokenization for indexing the documents there. The tokenization will be the foundation for using BM25 effectively during retrieval. I am not an expert in Thai, but maybe this helps:

Reader In an earlier version of your issue you included also some training code for the reader. How many question-answer pairs do you have in your "squad536_th_fix.json" and how did you create it? Generally, taking the multilingual xlm-roberta model is a good approach. Here's a few directions that could be worth exploring:

If your Thai dataset is small, it can be beneficial to first train xlm-roberta on the English SQuAD dataset and then continue on your Thai dataset.
You can also try to use some public datasets (e.g.TyDi from Google incl ~ 11k examples for Thai) https://ai.googleblog.com/2020/02/tydi-qa-multilingual-question-answering.html
You say that your model doesn't return any answer. What are the eval metrics you see during training? It could also make sense to lower the confidence threshold to return more "text answers" via FARMReader(no_ans_boost=-100)

tholor commented 4 years ago

@laifuchicago any update on this? Did you manage to resolve the issue?

laifuchicago commented 4 years ago

To author: Yes, I use ICU_tokenizer plugin and solved this issue, thank you for your help. Jonathan Sung

deepset-ai / haystack

Elasticsearch can not retrieve Thai language #120