Mintplex-Labs / anything-llm

The all-in-one Desktop & Docker AI application with full RAG and AI Agent capabilities.
https://useanything.com
MIT License
17.04k stars 1.82k forks source link

[FEAT]: Before LanceDB Output to LLM Text encoding Big5 #1673

Open csb47jk opened 3 weeks ago

csb47jk commented 3 weeks ago

Description

Use Traditional Chinese files to upload to the built-in LanceDB~ As a result, the attached file found a possible answer ~ but the answer did not provide the answer.

  1. Confirmed data in LanceDB For example: \storage\lancedb\security_2.lance\data Chinese-related file content, data in LanceDB It may be more correct to convert it to Big5 and then give it to LLM.

Regarding text encoding issues, is it possible to provide solutions or encoding options before retrieving the results to derive the LLM model?

timothycarambat commented 3 weeks ago

Interesting, what were the scores on the citations? If the content for the documents and prompt are both using the same encoding + embedder then I wonder what the difference would be in retrieval.

Has pre-processing the data to Big5 encoding been proven to improve results on non-latin based charsets?

Also what embedder are you using? The default is the all-MiniLM-L6-v2 which IMO has done quite poorly for non-latin based texts