UnicodeDecodeError: 'gbk' codec can't decode byte 0x80 in position 48: illegal multibyte sequence

LKk8563 commented 1 year ago

PS D:\AI_DB_GPT\DB-GPT-main\tools> python .\knowlege_init.py 2023-06-02 10:34:50,102 INFO sqlalchemy.engine.Engine SELECT DATABASE() 2023-06-02 10:34:50,102 INFO sqlalchemy.engine.Engine [raw sql] {} 2023-06-02 10:34:50,104 INFO sqlalchemy.engine.Engine SELECT @@sql_mode 2023-06-02 10:34:50,104 INFO sqlalchemy.engine.Engine [raw sql] {} 2023-06-02 10:34:50,105 INFO sqlalchemy.engine.Engine SELECT @@lower_case_table_names 2023-06-02 10:34:50,105 INFO sqlalchemy.engine.Engine [raw sql] {} {'vector_store_name': 'default'} No sentence-transformers model found with name D:\AI_DB_GPT\DB-GPT-main\models\text2vec-large-chinese. Creating a new one with MEAN pooling. ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ D:\AI_DB_GPT\DB-GPT-main\tools\knowlege_init.py:59 in │ │ │ │ 56 │ vector_store_config = {"vector_store_name": vector_name} │ │ 57 │ print(vector_store_config) │ │ 58 │ kv = LocalKnowledgeInit(vector_store_config=vector_store_config) │ │ ❱ 59 │ vector_store = kv.knowledge_persist(file_path=DATASETS_DIR, append_mode=append_mode) │ │ 60 │ print("your knowledge embedding success...") │ │ 61 │ │ │ │ D:\AI_DB_GPT\DB-GPT-main\tools\knowlege_init.py:35 in knowledge_persist │ │ │ │ 32 │ │ │ model_name=LLM_MODEL_CONFIG["text2vec"], │ │ 33 │ │ │ vector_store_config=self.vector_store_config, │ │ 34 │ │ ) │ │ ❱ 35 │ │ vector_store = kv.knowledge_persist_initialization(append_mode) │ │ 36 │ │ return vector_store │ │ 37 │ │ │ 38 │ def query(self, q): │ │ │ │ D:\AI_DB_GPT\DB-GPT-main\pilot\source_embedding\knowledge_embedding.py:71 in │ │ knowledge_persist_initialization │ │ │ │ 68 │ │ return self.knowledge_embedding_client.similar_search(text, topk) │ │ 69 │ │ │ 70 │ def knowledge_persist_initialization(self, append_mode): │ │ ❱ 71 │ │ documents = self._load_knownlege(self.file_path) │ │ 72 │ │ self.vector_client = VectorStoreConnector( │ │ 73 │ │ │ CFG.VECTOR_STORE_TYPE, self.vector_store_config │ │ 74 │ │ ) │ │ │ │ D:\AI_DB_GPT\DB-GPT-main\pilot\source_embedding\knowledge_embedding.py:83 in _loadknownlege │ │ │ │ 80 │ │ for root, , files in os.walk(path, topdown=False): │ │ 81 │ │ │ for file in files: │ │ 82 │ │ │ │ filename = os.path.join(root, file) │ │ ❱ 83 │ │ │ │ docs = self._load_file(filename) │ │ 84 │ │ │ │ new_docs = [] │ │ 85 │ │ │ │ for doc in docs: │ │ 86 │ │ │ │ │ doc.metadata = { │ │ │ │ D:\AI_DB_GPT\DB-GPT-main\pilot\source_embedding\knowledge_embedding.py:100 in _load_file │ │ │ │ 97 │ │ │ text_splitter = CHNDocumentSplitter( │ │ 98 │ │ │ │ pdf=True, sentence_size=KNOWLEDGE_CHUNK_SPLIT_SIZE │ │ 99 │ │ │ ) │ │ ❱ 100 │ │ │ docs = loader.load_and_split(text_splitter) │ │ 101 │ │ │ i = 0 │ │ 102 │ │ │ for d in docs: │ │ 103 │ │ │ │ content = markdown.markdown(d.page_content) │ │ │ │ C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\langchain\docum │ │ ent_loaders\base.py:25 in load_and_split │ │ │ │ 22 │ │ │ _text_splitter: TextSplitter = RecursiveCharacterTextSplitter() │ │ 23 │ │ else: │ │ 24 │ │ │ _text_splitter = text_splitter │ │ ❱ 25 │ │ docs = self.load() │ │ 26 │ │ return _text_splitter.split_documents(docs) │ │ 27 │ │ │ │ C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\langchain\docum │ │ ent_loaders\text.py:18 in load │ │ │ │ 15 │ def load(self) -> List[Document]: │ │ 16 │ │ """Load from file path.""" │ │ 17 │ │ with open(self.file_path, encoding=self.encoding) as f: │ │ ❱ 18 │ │ │ text = f.read() │ │ 19 │ │ metadata = {"source": self.file_path} │ │ 20 │ │ return [Document(page_content=text, metadata=metadata)] │ │ 21 │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ UnicodeDecodeError: 'gbk' codec can't decode byte 0x80 in position 48: illegal multibyte sequence

Q：Why does this kind of error occur? Do we need to fix the encoding format of each file one by one?

csunny commented 1 year ago

@Aries-ckt

csunny commented 1 year ago

Thanks again, please try the latest version v0.4.2.

eosphoros-ai / DB-GPT

UnicodeDecodeError: 'gbk' codec can't decode byte 0x80 in position 48: illegal multibyte sequence #145