[本地部署]: 上传索引文件后，构建索引文件失败

GwendolynKoh commented 1 year ago

是否已存在现有反馈与解答？

[X] 我确认没有已有issue或discussion，且已阅读常见问题。

是否是一个代理配置相关的疑问？

[X] 我确认这不是一个代理配置相关的疑问。

错误描述

复现操作

1.花了我一天的时间，终于学会了吧 requirements.txt 的llama_index==0.5.5删除，然后pip install 并所有upgrade 消除了llama_index常见问题中langchain的问题 2.然后心满意足地在凌晨5点钟终于尝试开始睡觉，醒来后，第一时间想着打开新世界的大门然而并没有。

错误日志

--- Logging error ---
Traceback (most recent call last):
  File "F:\ChuanhuChatGPT-main\modules\llama_func.py", line 53, in get_documents
    pdftext = parse_pdf(filepath, two_column).text
  File "F:\ChuanhuChatGPT-main\modules\pdf_func.py", line 87, in parse_pdf
    title, user_info, first_page = get_title_with_cropped_page(pdf.pages[0])
  File "F:\ChuanhuChatGPT-main\modules\pdf_func.py", line 64, in get_title_with_cropped_page
    user_info = [i["text"] for i in extract_words(first_page.within_bbox((x0,title_bottom,x1,top)))]
  File "F:\anaconda3\envs\chatgpt\lib\site-packages\pdfplumber\page.py", line 334, in within_bbox
    return CroppedPage(
  File "F:\anaconda3\envs\chatgpt\lib\site-packages\pdfplumber\page.py", line 464, in __init__
    test_proposed_bbox(self.bbox, parent_page.bbox)
  File "F:\anaconda3\envs\chatgpt\lib\site-packages\pdfplumber\page.py", line 428, in test_proposed_bbox
    bbox_area = utils.calculate_area(bbox)
  File "F:\anaconda3\envs\chatgpt\lib\site-packages\pdfplumber\utils\geometry.py", line 70, in calculate_area
    raise ValueError(f"{bbox} has a negative width or height.")
ValueError: (0, 719.41908, 595.276, 0) has a negative width or height.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "F:\ChuanhuChatGPT-main\modules\llama_func.py", line 117, in construct_index
    documents = get_documents(file_src)
  File "F:\ChuanhuChatGPT-main\modules\llama_func.py", line 59, in get_documents
    pdftext += page.extract_text()
  File "F:\anaconda3\envs\chatgpt\lib\site-packages\PyPDF2\_page.py", line 1851, in extract_text
    return self._extract_text(
  File "F:\anaconda3\envs\chatgpt\lib\site-packages\PyPDF2\_page.py", line 1342, in _extract_text
    cmaps[f] = build_char_map(f, space_width, obj)
  File "F:\anaconda3\envs\chatgpt\lib\site-packages\PyPDF2\_cmap.py", line 28, in build_char_map
    map_dict, space_code, int_entry = parse_to_unicode(ft, space_code)
  File "F:\anaconda3\envs\chatgpt\lib\site-packages\PyPDF2\_cmap.py", line 194, in parse_to_unicode
    cm = prepare_cm(ft)
  File "F:\anaconda3\envs\chatgpt\lib\site-packages\PyPDF2\_cmap.py", line 207, in prepare_cm
    tu = ft["/ToUnicode"]
  File "F:\anaconda3\envs\chatgpt\lib\site-packages\PyPDF2\generic\_data_structures.py", line 266, in __getitem__
    return dict.__getitem__(self, key).get_object()
  File "F:\anaconda3\envs\chatgpt\lib\site-packages\PyPDF2\generic\_base.py", line 259, in get_object
    obj = self.pdf.get_object(self)
  File "F:\anaconda3\envs\chatgpt\lib\site-packages\PyPDF2\_reader.py", line 1269, in get_object
    retval = self._encryption.decrypt_object(
  File "F:\anaconda3\envs\chatgpt\lib\site-packages\PyPDF2\_encryption.py", line 761, in decrypt_object
    return cf.decrypt_object(obj)
  File "F:\anaconda3\envs\chatgpt\lib\site-packages\PyPDF2\_encryption.py", line 185, in decrypt_object
    obj._data = self.stmCrypt.decrypt(obj._data)
  File "F:\anaconda3\envs\chatgpt\lib\site-packages\PyPDF2\_encryption.py", line 147, in decrypt
    raise DependencyError("PyCryptodome is required for AES algorithm")
PyPDF2.errors.DependencyError: PyCryptodome is required for AES algorithm

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "F:\anaconda3\envs\chatgpt\lib\logging\__init__.py", line 1100, in emit
    msg = self.format(record)
  File "F:\anaconda3\envs\chatgpt\lib\logging\__init__.py", line 943, in format
    return fmt.format(record)
  File "F:\anaconda3\envs\chatgpt\lib\logging\__init__.py", line 678, in format
    record.message = record.getMessage()
  File "F:\anaconda3\envs\chatgpt\lib\logging\__init__.py", line 368, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
  File "F:\anaconda3\envs\chatgpt\lib\threading.py", line 973, in _bootstrap
    self._bootstrap_inner()
  File "F:\anaconda3\envs\chatgpt\lib\threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "F:\anaconda3\envs\chatgpt\lib\site-packages\anyio\_backends\_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "F:\anaconda3\envs\chatgpt\lib\site-packages\gradio\utils.py", line 490, in async_iteration
    return next(iterator)
  File "F:\ChuanhuChatGPT-main\modules\chat_func.py", line 289, in predict
    index = construct_index(openai_api_key, file_src=files)
  File "F:\ChuanhuChatGPT-main\modules\llama_func.py", line 131, in construct_index
    logging.error("索引构建失败！", e)
Message: '索引构建失败！'
Arguments: (DependencyError('PyCryptodome is required for AES algorithm'),)
PyCryptodome is required for AES algorithm
2023-04-07 12:13:09,084 [INFO] [chat_func.py:291] 索引构建完成，获取回答中……
F:\anaconda3\envs\chatgpt\lib\site-packages\langchain\llms\openai.py:608: UserWarning: You are trying to use a chat model. This way of initializing it is no longer supported. Instead, please use: `from langchain.chat_models import ChatOpenAI`
  warnings.warn(
Traceback (most recent call last):
  File "F:\anaconda3\envs\chatgpt\lib\site-packages\gradio\routes.py", line 393, in run_predict
    output = await app.get_blocks().process_api(
  File "F:\anaconda3\envs\chatgpt\lib\site-packages\gradio\blocks.py", line 1108, in process_api
    result = await self.call_function(
  File "F:\anaconda3\envs\chatgpt\lib\site-packages\gradio\blocks.py", line 929, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "F:\anaconda3\envs\chatgpt\lib\site-packages\anyio\to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "F:\anaconda3\envs\chatgpt\lib\site-packages\anyio\_backends\_asyncio.py", line 937, in run_sync_in_worker_thread    return await future
  File "F:\anaconda3\envs\chatgpt\lib\site-packages\anyio\_backends\_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "F:\anaconda3\envs\chatgpt\lib\site-packages\gradio\utils.py", line 490, in async_iteration
    return next(iterator)
  File "F:\ChuanhuChatGPT-main\modules\chat_func.py", line 298, in predict
    query_object = GPTVectorStoreIndexQuery(index.index_struct, service_context=service_context, similarity_top_k=5, vector_store=index._vector_store, docstore=index._docstore)
AttributeError: 'NoneType' object has no attribute 'index_struct'

运行环境

- OS: windows11
- Browser: firefox
- Gradio version: 3.24.1
- Python version: 3.10.10

补充说明

具体的操作是： 1.调用 python .\ChuanhuChatbot.py 2.复制黏贴 api-key 3.上传索引文件 4.输入“你好”

1205129045x commented 1 year ago

解决了吗我现在提示这个

Traceback (most recent call last):
  File "C:\Users\12051\AppData\Roaming\Python\Python38\site-packages\gradio\routes.py", line 393, in run_predict
    output = await app.get_blocks().process_api(
  File "C:\Users\12051\AppData\Roaming\Python\Python38\site-packages\gradio\blocks.py", line 1069, in process_api
    result = await self.call_function(
  File "C:\Users\12051\AppData\Roaming\Python\Python38\site-packages\gradio\blocks.py", line 892, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "C:\Users\12051\AppData\Roaming\Python\Python38\site-packages\anyio\to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "C:\Users\12051\AppData\Roaming\Python\Python38\site-packages\anyio\_backends\_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "C:\Users\12051\AppData\Roaming\Python\Python38\site-packages\anyio\_backends\_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "C:\Users\12051\AppData\Roaming\Python\Python38\site-packages\gradio\utils.py", line 549, in async_iteration
    return next(iterator)
  File "C:\Users\12051\Desktop\ChuanhuChatGPT\modules\chat_func.py", line 270, in predict
    from llama_index.indices.vector_store.base_query import GPTVectorStoreIndexQuery
ModuleNotFoundError: No module named 'llama_index.indices.vector_store.base_query'

GwendolynKoh commented 1 year ago

解决了吗我现在提示这个

Traceback (most recent call last):
  File "C:\Users\12051\AppData\Roaming\Python\Python38\site-packages\gradio\routes.py", line 393, in run_predict
    output = await app.get_blocks().process_api(
  File "C:\Users\12051\AppData\Roaming\Python\Python38\site-packages\gradio\blocks.py", line 1069, in process_api
    result = await self.call_function(
  File "C:\Users\12051\AppData\Roaming\Python\Python38\site-packages\gradio\blocks.py", line 892, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "C:\Users\12051\AppData\Roaming\Python\Python38\site-packages\anyio\to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "C:\Users\12051\AppData\Roaming\Python\Python38\site-packages\anyio\_backends\_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "C:\Users\12051\AppData\Roaming\Python\Python38\site-packages\anyio\_backends\_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "C:\Users\12051\AppData\Roaming\Python\Python38\site-packages\gradio\utils.py", line 549, in async_iteration
    return next(iterator)
  File "C:\Users\12051\Desktop\ChuanhuChatGPT\modules\chat_func.py", line 270, in predict
    from llama_index.indices.vector_store.base_query import GPTVectorStoreIndexQuery
ModuleNotFoundError: No module named 'llama_index.indices.vector_store.base_query'

你这个问题，在常见问题里面就明确写了，在此不在重复，根本不是同一个问题，请勿捣乱。

GwendolynKoh commented 1 year ago

上传图片的更是一片乱码，应该是上图这里出了问题？上传图片后就只能看到这么多东西了 F:\anaconda3\envs\chatgpt\lib\site-packages\langchain\llms\openai.py:608: UserWarning: You are trying to use a chat model. This way of initializing it is no longer supported. Instead, please use: from langchain.chat_models import ChatOpenAI

GaiZhenbiao commented 1 year ago

你用的是最新版代码吗？你的这个报错应该是llama_index 0.5.5版本带来的，最新的代码已经做好了适配。

GwendolynKoh commented 1 year ago

你用的是最新版代码吗？你的这个报错应该是llama_index 0.5.5版本带来的，最新的代码已经做好了适配。

我十分肯定确定是最新版的代码，连hugface都是同样的错误。。已经无解了

GwendolynKoh commented 1 year ago

仍然无效的

GwendolynKoh commented 1 year ago

chatgpt的回答，问题是我的pdf就是一个正常的a4论文。。。。

GwendolynKoh commented 1 year ago

换了只有一页ppt 的pdf后可以打开了，具体大佬可不可以说一下pdf的限制是什么我，因为想导入长篇文字的论文。然而上传图片仍然错误

Keldos-Li commented 1 year ago

关于pdf：

部分pdf解析有可能出错，但大小并不是主要限制。

关于图片：

请参考 #460

GwendolynKoh commented 1 year ago

请问大佬们具体pdf的要求是怎么样的呢？我现在就是烧香拜佛，让他读论文的pdf，无论长短都没有成功过。。。。。救命啊！！！

Keldos-Li commented 1 year ago

你先换用一些兄弟软件试试吧……类似的开源的还蛮多的吧……看看他们行不行，如果他们可以我们不行可能是我们的问题

zhangchen116 commented 1 year ago

把modules/pdf_func.py中的get_title_with_cropped_page函数简单修改一下试试

def get_title_with_cropped_page(first_page):
    title = [] # 处理标题
    x0,top,x1,bottom = first_page.bbox # 获取页面边框

    for word in extract_words(first_page):
        word = SimpleNamespace(**word)

        if word.size >= 14:
            title.append(word.text)
            title_bottom = word.bottom
        elif word.text == "Abstract": # 获取页面abstract
            title_top = word.top

    user_info = [i["text"] for i in extract_words(first_page.within_bbox((x0,top,x1,bottom)))]
    # 裁剪掉上半部分, within_bbox: full_included; crop: partial_included
    return title, user_info, first_page.within_bbox((x0,top,x1,bottom))

ca1123 commented 1 year ago

对, 有的pdf其实是有文本的, 但是会说找不到root, 然后就寄了. 从pdf中提取文本, 行业标准的方案是什么啊?

GaiZhenbiao / ChuanhuChatGPT