Mintplex-Labs / anything-llm

The all-in-one Desktop & Docker AI application with built-in RAG, AI agents, and more.
https://anythingllm.com
MIT License
26.51k stars 2.65k forks source link

citation not encoding as UTF8? #298

Closed jamsnrihk closed 11 months ago

jamsnrihk commented 1 year ago

I tried to view citation, but seems showing windows not encoding as UTF8 format, showing some very strange characters compare with orginal one (original documents is chinese). cap7

timothycarambat commented 1 year ago

Looks like the PyMuPDF is translating the Mandarin to PinYin? Thats kind of bizarre. We dont have any translation service so weird it would use the pinyin over the native mandarin. Obviously this breaks all the similarity search and LLM response.

Do you have an example file in all mandarin we can replicate and test against?

jamsnrihk commented 1 year ago

may be need add "cjk" fonts in pyMUPDF? for example

import fitz font=fitz.Font("cjk") font.name 'Droid Sans Fallback Regular'

jamsnrihk commented 1 year ago

Looks like the PyMuPDF is translating the Mandarin to PinYin? Thats kind of bizarre. We dont have any translation service so weird it would use the pinyin over the native mandarin. Obviously this breaks all the similarity search and LLM response.

Do you have an example file in all mandarin we can replicate and test against?

Any updates for this issue?

sweetcard commented 11 months ago

Looks like the PyMuPDF is translating the Mandarin to PinYin? Thats kind of bizarre. We dont have any translation service so weird it would use the pinyin over the native mandarin. Obviously this breaks all the similarity search and LLM response. Do you have an example file in all mandarin we can replicate and test against?

Any updates for this issue?

Wait for updates.😄

timothycarambat commented 11 months ago

Can you send me the file that was used (or one similar) as you can imagine I don't have many Mandarin PDFs lying around

sweetcard commented 11 months ago

Can you send me the file that was used (or one similar) as you can imagine I don't have many Mandarin PDFs lying around You can try this file :

https://www.csb.gov.hk/tc_chi/publications_stat/publication/files/off_correspondence_3ed.pdf

jamsnrihk commented 10 months ago

Please kindly check attached sample file.

Best Regards James

On Wed, 13 Dec 2023 at 22:08, sweetcard @.***> wrote:

Can you send me the file that was used (or one similar) as you can imagine I don't have many Mandarin PDFs lying around You can try this file :

https://www.csb.gov.hk/tc_chi/publications_stat/publication/files/off_correspondence_3ed.pdf http://url

— Reply to this email directly, view it on GitHub https://github.com/Mintplex-Labs/anything-llm/issues/298#issuecomment-1853985726, or unsubscribe https://github.com/notifications/unsubscribe-auth/A76VNQVBMIH4FYPLR2RC6MDYJGZFPAVCNFSM6AAAAAA6SKA2IOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJTHE4DKNZSGY . You are receiving this because you authored the thread.Message ID: @.***>

-- P Help save paper - do you need to print this email?

timothycarambat commented 10 months ago

@jamsnrihk Just tested on this exact file and the latest version of anythingLLM works with it now and will show the filename, snippet, and context in its written language.

jamsnrihk commented 10 months ago

Thanks, I'll check it out!

On Thu, 14 Dec 2023 at 13:20, Timothy Carambat @.***> wrote:

@jamsnrihk https://github.com/jamsnrihk Just tested on this exact file and the latest version of anythingLLM works with it now and will show the filename, snippet, and context in its written language.

— Reply to this email directly, view it on GitHub https://github.com/Mintplex-Labs/anything-llm/issues/298#issuecomment-1855149717, or unsubscribe https://github.com/notifications/unsubscribe-auth/A76VNQS72ZWHNAUBI32UMETYJKEAZAVCNFSM6AAAAAA6SKA2IOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJVGE2DSNZRG4 . You are receiving this because you were mentioned.Message ID: @.***>

-- P Help save paper - do you need to print this email?