deepset-ai / haystack

:mag: AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
17.59k stars 1.91k forks source link

Elasticsearch can not retrieve Thai language #120

Closed laifuchicago closed 4 years ago

laifuchicago commented 4 years ago

To author: Currently, our team is trying to do the inference in Thai language. But in the retrieving part, Elasticsearch can not retrieve Thai by using Wh questions, it would show nothing. We have tried Korean, russian and Chinese, ES can retrieve them, only Thai can not. Do you have any idea ? Thank you. current problem of th

tholor commented 4 years ago

Ok so just to clarify:

Retriever If the issue is really about retrieval using Elasticsearch, you might want to use a different tokenization for indexing the documents there. The tokenization will be the foundation for using BM25 effectively during retrieval. I am not an expert in Thai, but maybe this helps:

Reader In an earlier version of your issue you included also some training code for the reader. How many question-answer pairs do you have in your "squad536_th_fix.json" and how did you create it? Generally, taking the multilingual xlm-roberta model is a good approach. Here's a few directions that could be worth exploring:

tholor commented 4 years ago

@laifuchicago any update on this? Did you manage to resolve the issue?

laifuchicago commented 4 years ago

To author: Yes, I use ICU_tokenizer plugin and solved this issue, thank you for your help. Jonathan Sung