Closed jamsnrihk closed 11 months ago
Looks like the PyMuPDF is translating the Mandarin to PinYin? Thats kind of bizarre. We dont have any translation service so weird it would use the pinyin over the native mandarin. Obviously this breaks all the similarity search and LLM response.
Do you have an example file in all mandarin we can replicate and test against?
may be need add "cjk" fonts in pyMUPDF? for example
import fitz font=fitz.Font("cjk") font.name 'Droid Sans Fallback Regular'
Looks like the PyMuPDF is translating the Mandarin to PinYin? Thats kind of bizarre. We dont have any translation service so weird it would use the pinyin over the native mandarin. Obviously this breaks all the similarity search and LLM response.
Do you have an example file in all mandarin we can replicate and test against?
Any updates for this issue?
Looks like the PyMuPDF is translating the Mandarin to PinYin? Thats kind of bizarre. We dont have any translation service so weird it would use the pinyin over the native mandarin. Obviously this breaks all the similarity search and LLM response. Do you have an example file in all mandarin we can replicate and test against?
Any updates for this issue?
Wait for updates.😄
Can you send me the file that was used (or one similar) as you can imagine I don't have many Mandarin PDFs lying around
Can you send me the file that was used (or one similar) as you can imagine I don't have many Mandarin PDFs lying around You can try this file :
https://www.csb.gov.hk/tc_chi/publications_stat/publication/files/off_correspondence_3ed.pdf
Please kindly check attached sample file.
Best Regards James
On Wed, 13 Dec 2023 at 22:08, sweetcard @.***> wrote:
Can you send me the file that was used (or one similar) as you can imagine I don't have many Mandarin PDFs lying around You can try this file :
https://www.csb.gov.hk/tc_chi/publications_stat/publication/files/off_correspondence_3ed.pdf http://url
— Reply to this email directly, view it on GitHub https://github.com/Mintplex-Labs/anything-llm/issues/298#issuecomment-1853985726, or unsubscribe https://github.com/notifications/unsubscribe-auth/A76VNQVBMIH4FYPLR2RC6MDYJGZFPAVCNFSM6AAAAAA6SKA2IOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJTHE4DKNZSGY . You are receiving this because you authored the thread.Message ID: @.***>
-- P Help save paper - do you need to print this email?
@jamsnrihk Just tested on this exact file and the latest version of anythingLLM works with it now and will show the filename, snippet, and context in its written language.
Thanks, I'll check it out!
On Thu, 14 Dec 2023 at 13:20, Timothy Carambat @.***> wrote:
@jamsnrihk https://github.com/jamsnrihk Just tested on this exact file and the latest version of anythingLLM works with it now and will show the filename, snippet, and context in its written language.
— Reply to this email directly, view it on GitHub https://github.com/Mintplex-Labs/anything-llm/issues/298#issuecomment-1855149717, or unsubscribe https://github.com/notifications/unsubscribe-auth/A76VNQS72ZWHNAUBI32UMETYJKEAZAVCNFSM6AAAAAA6SKA2IOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJVGE2DSNZRG4 . You are receiving this because you were mentioned.Message ID: @.***>
-- P Help save paper - do you need to print this email?
I tried to view citation, but seems showing windows not encoding as UTF8 format, showing some very strange characters compare with orginal one (original documents is chinese).