Farzad-R / LLM-Zero-to-Hundred

This repository contains different LLM chatbot projects (RAG, LLM agents, etc.) and well-known techniques for training and fine tuning LLMs.
266 stars 142 forks source link

'latin-1' codec can't encode character #15

Closed AnthoSocofer closed 5 months ago

AnthoSocofer commented 5 months ago

Hello,

When chatting with the bot, I often encounter this error:

File "C:\Repos\AI_project\Demo\demo_2024_05_02\RAG_GPT_OpenAI\src\utils\chatbot.py", line 60, in respond retrieved_content = ChatBot.clean_references(docs)

File "C:\Repos\AI_project\Demo\demo_2024_05_02\RAG_GPT_OpenAI\src\utils\chatbot.py", line 117, in clean_references content = content.encode('latin1').decode('utf-8', 'ignore')

UnicodeEncodeError: 'latin-1' codec can't encode character '\uf0b7' in position 383: ordinal not in range(256)

Do you happen to know how to solve this issue?

Farzad-R commented 5 months ago

Hi, I designed the clean_references function in a way that it handles a typical text in English. I use it after I retrieve the text from the vectordb and before passing the text to the GPT model. I am not sure what type of text you are retrieving from the vectordb but whatever it is, the function is not happy with it. A quick and easy solve would be:

  1. Print out the retrieved content from your documents
  2. Give the retrieved text along with all the weird signs in it + the error you pasted above + clean_references function to Chatgpt and ask it to update the function for you so it can clean the text properly.
  3. Test the function. Iterate and update it if necessary.
  4. Use the new function in the project. The whole idea is to adjust that function to the type of documents that you want to use in the project.
AnthoSocofer commented 5 months ago

This modification worked well for me:

Replace incorrect unicode characters with correct ones

        try:
            # Attempt to encode to latin-1, then decode back to utf-8
            content =  content.encode('latin1').decode('utf-8', 'ignore')
        except Exception as e:
            content = content.encode('unicode_escape').decode('unicode_escape')

Thanks, Farzad! By the way, great work. This project is quite inspiring.