关于文件上传到知识库并向量化的问题

302658980 commented 7 months ago

-03-13 18:32:46,781 - utils.py[line:286] - INFO: RapidOCRPDFLoader used for /root/autodl-tmp/Langchain-Chatchat/knowledge_base/doc-1701875989560164354/content/526062814771613696-百度推广签约合同(2).pdf RapidOCRPDFLoader context page index: 9: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [03:37<00:00, 21.76s/it] '(MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /None/resolve/main/tokenizer_config.json (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f217869f280>, 'Connection to huggingface.co timed out. (connect timeout=10)'))"), '(Request ID: 86bc88a8-6696-42df-b11a-a472db0dc4b1)')' thrown while requesting HEAD https://huggingface.co/None/resolve/main/tokenizer_config.json 2024-03-13 18:36:39,231 - _http.py[line:271] - WARNING: '(MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /None/resolve/main/tokenizer_config.json (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f217869f280>, 'Connection to huggingface.co timed out. (connect timeout=10)'))"), '(Request ID: 86bc88a8-6696-42df-b11a-a472db0dc4b1)')' thrown while requesting HEAD https://huggingface.co/None/resolve/main/tokenizer_config.json '(MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /None/resolve/main/config.json (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f217869f760>, 'Connection to huggingface.co timed out. (connect timeout=10)'))"), '(Request ID: 2d7df5d5-f601-4210-83c5-2764758afb2c)')' thrown while requesting HEAD https://huggingface.co/None/resolve/main/config.json 2024-03-13 18:36:49,288 - _http.py[line:271] - WARNING: '(MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /None/resolve/main/config.json (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f217869f760>, 'Connection to huggingface.co timed out. (connect timeout=10)'))"), '(Request ID: 2d7df5d5-f601-4210-83c5-2764758afb2c)')' thrown while requesting HEAD https://huggingface.co/None/resolve/main/config.json We couldn't connect to 'https://huggingface.co' to load this file, couldn't find it in the cached files and it looks like None is not the path to a directory containing a file named config.json. Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'. 文档切分示例：page_content='《百度推广服务协议》\n\n0447763\n\n合同编号：\n\n深圳大拿智能设备有限公司\n\n方：\n\n方：百度国际科技（深圳）有限公司\n\n法定代表人：\n\n法定代表人：崔珊珊\n\n联系地址：\n\n联系地址：深圳市南山区粤海街道滨海社区海天一\n\n路8号百度国际大厦西塔楼3层\n\n联系人：\n\n联系人：\n\n电话：\n\n电话：售前：4008060018、售后：4009200000\n\n电子邮件：\n\n电子邮箱：zxhelp@sh.baidu.com\n\n开户行：\n\n开户行：招商银行深圳分行创维大厦支行\n\n账号：' metadata={'source': '/root/autodl-tmp/Langchain-Chatchat/knowledge_base/doc-1701875989560164354/content/526062814771613696-百度推广签约合同(2).pdf'} 2024-03-13 18:36:49,295 - faiss_cache.py[line:80] - INFO: loading vector store in 'doc-1701875989560164354/vector_store/bge-large-zh' from disk. 2024-03-13 18:36:49,453 - SentenceTransformer.py[line:66] - INFO: Load pretrained SentenceTransformer: /root/model/bge-large-zh 2024-03-13 18:36:53,622 - loader.py[line:54] - INFO: Loading faiss with AVX2 support. 2024-03-13 18:36:53,654 - loader.py[line:56] - INFO: Successfully loaded faiss with AVX2 support. Batches: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00, 2.56it/s] 2024-03-13 18:36:56,038 - faiss_cache.py[line:24] - INFO: 已将向量库 ('doc-1701875989560164354', 'bge-large-zh') 保存到磁盘

中间的报错有影响吗。

dosubot[bot] commented 7 months ago

嘿，@302658980！很高兴再次见到你。希望你最近一切都好！

根据您提供的日志，出现的错误是在尝试从 'https://huggingface.co' 加载文件时连接超时。这可能是由于网络问题或者 'huggingface.co' 服务器的问题。这个错误会影响到文件的向量化过程，因为它需要从 'huggingface.co' 加载预训练的语言模型和分词器。

在 Langchain-Chatchat 中，我们使用 Hugging Face Transformers 库来加载和使用预训练的语言模型，这些模型通常存储在 'huggingface.co'。当我们尝试加载一个模型时，如果本地没有缓存的模型文件，那么库会尝试从 'huggingface.co' 下载。在您的情况下，由于无法连接到 'huggingface.co'，所以加载模型失败，进而导致向量化过程无法进行。

解决这个问题的一种方法是确保您的服务器可以访问 'https://huggingface.co'。如果网络环境有限制，您可以尝试在有完全网络访问权限的环境中预先下载所需的模型文件，然后将这些文件移动到您的服务器上。具体步骤如下：

在有完全网络访问权限的环境中，运行与您的代码相同的模型加载代码。例如，如果您正在使用 'bert-base-chinese' 模型，那么您可以运行以下代码：

from transformers import BertModel
model = BertModel.from_pretrained('bert-base-chinese')

这将会下载模型文件并保存到本地的缓存目录（默认是 '~/.cache/huggingface/transformers'）。

找到缓存目录中的模型文件，将它们复制到您的服务器上相同的位置。
在您的服务器上运行您的代码，此时它应该能够从本地缓存加载模型，而不需要访问 'https://huggingface.co'。

如果问题仍然存在，那么可能需要更详细的日志信息来进一步诊断。

Sources

#### About Dosu

This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.

github-actions[bot] commented 6 months ago

这个问题已经被标记为 stale ，因为它已经超过 30 天没有任何活动。

chatchat-space / Langchain-Chatchat

关于文件上传到知识库并向量化的问题 #3308

Sources