danswer-ai / danswer

Gen-AI Chat for Teams - Think ChatGPT if it had access to your team's unique knowledge.
https://docs.danswer.dev/
Other
9.78k stars 1.09k forks source link

Other languages than English #664

Open gius opened 8 months ago

gius commented 8 months ago

Is it possible to change the query and answer language for a particular Danswer instance?

Would it be possible to adapt to the user's query language?

Weves commented 8 months ago

Hey @gius ! Yes it should be possible. There are two main steps you'd need to do:

  1. Pick a new embedding model which handles your language of choice (see https://github.com/danswer-ai/danswer/blob/main/backend/danswer/search/search_nlp_models.py#L32 for where this is defined). You will also have to re-index all documents with this new model (if there are any already indexed)
  2. Disable the re-ranking performed by the cross encoder ensemble. It will probably be hard to find a multi-lingual replacement for these models. You can do that by adding SKIP_RERANKING=true in your .env file.

Let me know if that makes sense and/or you have any follow up questions!

alex-feel commented 8 months ago

Hi! There can be several scenarios pertaining to the language of the content and the queries:

  1. Monolingual Content with Multilingual Queries:

    • All content is in English, but the query is in a different language. In this case, it makes sense to translate the query to English and then proceed with the usual operations as if the query was initially in English.
    • It's essential to instruct the underlying model to translate the response back to the language of the query.
    • The translation can be handled either by the model itself or through external APIs like Google Translate, Deepl, among others.
  2. Multilingual Content:

    • The content is multilingual, say, available in English and another language. Here, it's logical to translate the query to all the languages of the content and perform as many iterations as there are languages in the content (depending on the settings since this would incur higher costs, but it's definitely needed by some, like me :)).
    • After synthesizing the response, instruct the underlying model to translate the answer back to the query's language.
    • Everything not mentioned or described here follows the logic of the first point.

These enhancements would significantly broaden Danswer's utility across diverse user bases and content landscapes, making it a more globally adaptable solution. By addressing the multilingual challenges head-on, we can ensure that Danswer remains competitive and inclusive in handling information retrieval and question answering tasks across different language domains.

gius commented 8 months ago

Thank you, guys, I am exploring the solution! I will have to dig a little bit deeper into Danswer architecture to better understand the pipeline (e.g., adjust the AI thoughts section).

If someone else needs it, my env file looks like this:

DOCUMENT_ENCODER_MODEL=sentence-transformers/all-MiniLM-L12-v2
SKIP_RERANKING=true
m0wer commented 2 weeks ago

This is now covered by the documentation: https://docs.danswer.dev/multilingual_setup