AI Native Data App Development framework with AWEL(Agentic Workflow Expression Language) and Agents
13.85k
stars
1.87k
forks
source link
UnicodeDecodeError: 'gbk' codec can't decode byte 0x80 in position 48: illegal multibyte sequence #145
Closed
LKk8563 closed 1 year ago
PS D:\AI_DB_GPT\DB-GPT-main\tools> python .\knowlege_init.py 2023-06-02 10:34:50,102 INFO sqlalchemy.engine.Engine SELECT DATABASE() 2023-06-02 10:34:50,102 INFO sqlalchemy.engine.Engine [raw sql] {} 2023-06-02 10:34:50,104 INFO sqlalchemy.engine.Engine SELECT @@sql_mode 2023-06-02 10:34:50,104 INFO sqlalchemy.engine.Engine [raw sql] {} 2023-06-02 10:34:50,105 INFO sqlalchemy.engine.Engine SELECT @@lower_case_table_names 2023-06-02 10:34:50,105 INFO sqlalchemy.engine.Engine [raw sql] {} {'vector_store_name': 'default'} No sentence-transformers model found with name D:\AI_DB_GPT\DB-GPT-main\models\text2vec-large-chinese. Creating a new one with MEAN pooling. ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ D:\AI_DB_GPT\DB-GPT-main\tools\knowlege_init.py:59 in │
│ │
│ 56 │ vector_store_config = {"vector_store_name": vector_name} │
│ 57 │ print(vector_store_config) │
│ 58 │ kv = LocalKnowledgeInit(vector_store_config=vector_store_config) │
│ ❱ 59 │ vector_store = kv.knowledge_persist(file_path=DATASETS_DIR, append_mode=append_mode) │
│ 60 │ print("your knowledge embedding success...") │
│ 61 │
│ │
│ D:\AI_DB_GPT\DB-GPT-main\tools\knowlege_init.py:35 in knowledge_persist │
│ │
│ 32 │ │ │ model_name=LLM_MODEL_CONFIG["text2vec"], │
│ 33 │ │ │ vector_store_config=self.vector_store_config, │
│ 34 │ │ ) │
│ ❱ 35 │ │ vector_store = kv.knowledge_persist_initialization(append_mode) │
│ 36 │ │ return vector_store │
│ 37 │ │
│ 38 │ def query(self, q): │
│ │
│ D:\AI_DB_GPT\DB-GPT-main\pilot\source_embedding\knowledge_embedding.py:71 in │
│ knowledge_persist_initialization │
│ │
│ 68 │ │ return self.knowledge_embedding_client.similar_search(text, topk) │
│ 69 │ │
│ 70 │ def knowledge_persist_initialization(self, append_mode): │
│ ❱ 71 │ │ documents = self._load_knownlege(self.file_path) │
│ 72 │ │ self.vector_client = VectorStoreConnector( │
│ 73 │ │ │ CFG.VECTOR_STORE_TYPE, self.vector_store_config │
│ 74 │ │ ) │
│ │
│ D:\AI_DB_GPT\DB-GPT-main\pilot\source_embedding\knowledge_embedding.py:83 in _loadknownlege │
│ │
│ 80 │ │ for root, , files in os.walk(path, topdown=False): │
│ 81 │ │ │ for file in files: │
│ 82 │ │ │ │ filename = os.path.join(root, file) │
│ ❱ 83 │ │ │ │ docs = self._load_file(filename) │
│ 84 │ │ │ │ new_docs = [] │
│ 85 │ │ │ │ for doc in docs: │
│ 86 │ │ │ │ │ doc.metadata = { │
│ │
│ D:\AI_DB_GPT\DB-GPT-main\pilot\source_embedding\knowledge_embedding.py:100 in _load_file │
│ │
│ 97 │ │ │ text_splitter = CHNDocumentSplitter( │
│ 98 │ │ │ │ pdf=True, sentence_size=KNOWLEDGE_CHUNK_SPLIT_SIZE │
│ 99 │ │ │ ) │
│ ❱ 100 │ │ │ docs = loader.load_and_split(text_splitter) │
│ 101 │ │ │ i = 0 │
│ 102 │ │ │ for d in docs: │
│ 103 │ │ │ │ content = markdown.markdown(d.page_content) │
│ │
│ C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\langchain\docum │
│ ent_loaders\base.py:25 in load_and_split │
│ │
│ 22 │ │ │ _text_splitter: TextSplitter = RecursiveCharacterTextSplitter() │
│ 23 │ │ else: │
│ 24 │ │ │ _text_splitter = text_splitter │
│ ❱ 25 │ │ docs = self.load() │
│ 26 │ │ return _text_splitter.split_documents(docs) │
│ 27 │
│ │
│ C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\langchain\docum │
│ ent_loaders\text.py:18 in load │
│ │
│ 15 │ def load(self) -> List[Document]: │
│ 16 │ │ """Load from file path.""" │
│ 17 │ │ with open(self.file_path, encoding=self.encoding) as f: │
│ ❱ 18 │ │ │ text = f.read() │
│ 19 │ │ metadata = {"source": self.file_path} │
│ 20 │ │ return [Document(page_content=text, metadata=metadata)] │
│ 21 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
UnicodeDecodeError: 'gbk' codec can't decode byte 0x80 in position 48: illegal multibyte sequence
Q:Why does this kind of error occur? Do we need to fix the encoding format of each file one by one?