Open csb47jk opened 3 weeks ago
Interesting, what were the scores on the citations? If the content for the documents and prompt are both using the same encoding + embedder then I wonder what the difference would be in retrieval.
Has pre-processing the data to Big5 encoding been proven to improve results on non-latin based charsets?
Also what embedder are you using? The default is the all-MiniLM-L6-v2 which IMO has done quite poorly for non-latin based texts
Description
Use Traditional Chinese files to upload to the built-in LanceDB~ As a result, the attached file found a possible answer ~ but the answer did not provide the answer.
Regarding text encoding issues, is it possible to provide solutions or encoding options before retrieving the results to derive the LLM model?