QwenLM / Qwen-Agent

Agent framework and applications built upon Qwen>=2.0, featuring Function Calling, Code Interpreter, RAG, and Chrome extension.
https://pypi.org/project/qwen-agent/
Other
3.23k stars 316 forks source link

assistant_rag.py 运行后 上传word,text,以及网页链接,均无法识别到;请问是我哪里操作错了吗; #249

Open wuybo opened 3 months ago

wuybo commented 3 months ago

67d1b8a03da3bda282b0c9bba016ec2


from qwen_agent.agents import Assistant
from qwen_agent.llm import get_chat_model
from qwen_agent.gui import WebUI
import llm_public
llm = llm_public.llm_openai() # 引用openai 本地模型
def test():
    bot = Assistant(llm=llm)

    messages = [{'role': 'user', 'content': [{'text': '吃鸡蛋被噎死,可以起诉养鸡的饲养员吗'}, {'file': 'https://jkwwt.acftu.org/jkwwtzcfg/202203/P020220325353963962863.pdf'}]}]
    for rsp in bot.run(messages):
        print(rsp)
def app_gui():
    # Define the agent
    bot = Assistant(llm=llm,
                    name='Assistant',
                    description='使用RAG检索并回答,支持文件类型:PDF/Word/PPT/TXT/HTML。')
    chatbot_config = {
        'prompt.suggestions': [
            {
                'text': '第二章第一句话是什么?'
            },
        ]
    }
    WebUI(bot, chatbot_config=chatbot_config).run()

if __name__ == '__main__':
    # test()
    app_gui()
JianxinMa commented 3 months ago

方便贴一下命令行终端的log吗?我们看下有没有什么报错

wuybo commented 3 months ago

``D:\pytho\Qwen-Agent-main\Scripts\python.exe D:\Backup\Downloads\Qwen-Agent-main\examples\assistant_rag.py Running on local URL: http://127.0.0.1:7861

To create a public link, set share=True in launch(). 2024-07-04 16:44:13,588 - simple_doc_parser.py - 324 - INFO - Read parsed C:\Users\Administrator\AppData\Local\Temp\gradio\88c59b31ab04a1b2ddb9c941679bf6e787fc093b\开发者权利声明1.pdf from cache. 2024-07-04 16:44:13,588 - doc_parser.py - 114 - INFO - Start chunking C:\Users\Administrator\AppData\Local\Temp\gradio\88c59b31ab04a1b2ddb9c941679bf6e787fc093b\开发者权利声明1.pdf (开发者权利声明1.pdf)... 2024-07-04 16:44:13,589 - doc_parser.py - 132 - INFO - Finished chunking C:\Users\Administrator\AppData\Local\Temp\gradio\88c59b31ab04a1b2ddb9c941679bf6e787fc093b\开发者权利声明1.pdf (开发者权利声明1.pdf). Time spent: 0.0010001659393310547 seconds. 2024-07-04 16:44:47,523 - utils.py - 69 - ERROR - Traceback (most recent call last): File "D:\pytho\Qwen-Agent-main\lib\site-packages\qwen_agent\utils\utils.py", line 225, in get_file_type content = read_text_from_file(path) File "D:\pytho\Qwen-Agent-main\lib\site-packages\qwen_agent\utils\utils.py", line 186, in read_text_from_file file_content = file.read() File "D:\Programs\python3.10\lib\codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 47: invalid start byte

2024-07-04 16:44:47,562 - simple_doc_parser.py - 324 - INFO - Read parsed C:\Users\Administrator\AppData\Local\Temp\gradio\88c59b31ab04a1b2ddb9c941679bf6e787fc093b\开发者权利声明1.pdf from cache. 2024-07-04 16:44:47,563 - doc_parser.py - 114 - INFO - Start chunking C:\Users\Administrator\AppData\Local\Temp\gradio\88c59b31ab04a1b2ddb9c941679bf6e787fc093b\开发者权利声明1.pdf (开发者权利声明1.pdf)... 2024-07-04 16:44:47,563 - doc_parser.py - 132 - INFO - Finished chunking C:\Users\Administrator\AppData\Local\Temp\gradio\88c59b31ab04a1b2ddb9c941679bf6e787fc093b\开发者权利声明1.pdf (开发者权利声明1.pdf). Time spent: 0.0 seconds. 2024-07-04 16:46:35,746 - utils.py - 69 - ERROR - Traceback (most recent call last): File "D:\pytho\Qwen-Agent-main\lib\site-packages\qwen_agent\utils\utils.py", line 225, in get_file_type content = read_text_from_file(path) File "D:\pytho\Qwen-Agent-main\lib\site-packages\qwen_agent\utils\utils.py", line 186, in read_text_from_file file_content = file.read() File "D:\Programs\python3.10\lib\codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 47: invalid start byte

2024-07-04 16:46:35,748 - utils.py - 69 - ERROR - Traceback (most recent call last): File "D:\pytho\Qwen-Agent-main\lib\site-packages\qwen_agent\utils\utils.py", line 225, in get_file_type content = read_text_from_file(path) File "D:\pytho\Qwen-Agent-main\lib\site-packages\qwen_agent\utils\utils.py", line 186, in read_text_from_file file_content = file.read() File "D:\Programs\python3.10\lib\codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 0: invalid continuation byte

2024-07-04 16:46:35,787 - simple_doc_parser.py - 324 - INFO - Read parsed C:\Users\Administrator\AppData\Local\Temp\gradio\88c59b31ab04a1b2ddb9c941679bf6e787fc093b\开发者权利声明1.pdf from cache. 2024-07-04 16:46:35,787 - doc_parser.py - 114 - INFO - Start chunking C:\Users\Administrator\AppData\Local\Temp\gradio\88c59b31ab04a1b2ddb9c941679bf6e787fc093b\开发者权利声明1.pdf (开发者权利声明1.pdf)... 2024-07-04 16:46:35,787 - doc_parser.py - 132 - INFO - Finished chunking C:\Users\Administrator\AppData\Local\Temp\gradio\88c59b31ab04a1b2ddb9c941679bf6e787fc093b\开发者权利声明1.pdf (开发者权利声明1.pdf). Time spent: 0.0 seconds. 2024-07-04 16:50:10,261 - utils.py - 69 - ERROR - Traceback (most recent call last): File "D:\pytho\Qwen-Agent-main\lib\site-packages\qwen_agent\utils\utils.py", line 225, in get_file_type content = read_text_from_file(path) File "D:\pytho\Qwen-Agent-main\lib\site-packages\qwen_agent\utils\utils.py", line 186, in read_text_from_file file_content = file.read() File "D:\Programs\python3.10\lib\codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 47: invalid start byte

2024-07-04 16:50:10,262 - utils.py - 69 - ERROR - Traceback (most recent call last): File "D:\pytho\Qwen-Agent-main\lib\site-packages\qwen_agent\utils\utils.py", line 225, in get_file_type content = read_text_from_file(path) File "D:\pytho\Qwen-Agent-main\lib\site-packages\qwen_agent\utils\utils.py", line 186, in read_text_from_file file_content = file.read() File "D:\Programs\python3.10\lib\codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 0: invalid continuation byte

2024-07-04 16:50:15,078 - split_query.py - 82 - INFO - Extracted info from query: {"information": ["https://www.gov.cn/xinwen/2020-06/01/content_5516649.htm

方便贴一下命令行终端的log吗?我们看下有没有什么报错

wuybo commented 3 months ago

2024-07-04 17:23:06,002 - simple_doc_parser.py - 326 - INFO - Start parsing C:\Users\Administrator\AppData\Local\Temp\gradio\d8d0bc75266a5fc0dc442eb81b70bbabe1301cde\民法典.pdf... 2024-07-04 17:23:19,483 - simple_doc_parser.py - 365 - INFO - Finished parsing C:\Users\Administrator\AppData\Local\Temp\gradio\d8d0bc75266a5fc0dc442eb81b70bbabe1301cde\民法典.pdf. Time spent: 13.480265617370605 seconds. 2024-07-04 17:23:19,541 - doc_parser.py - 114 - INFO - Start chunking C:\Users\Administrator\AppData\Local\Temp\gradio\d8d0bc75266a5fc0dc442eb81b70bbabe1301cde\民法典.pdf (民法典.pdf)... 2024-07-04 17:23:19,596 - doc_parser.py - 132 - INFO - Finished chunking C:\Users\Administrator\AppData\Local\Temp\gradio\d8d0bc75266a5fc0dc442eb81b70bbabe1301cde\民法典.pdf (民法典.pdf). Time spent: 0.05436825752258301 seconds.

方便贴一下命令行终端的log吗?我们看下有没有什么报错

JianxinMa commented 3 months ago

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 47: invalid start byte

看起来是因为文件不是utf-8编码,可能是windows平台遇到gbk中文文档了。我试下能不能复现&fix这个问题。

wuybo commented 3 months ago

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 47: invalid start byte

看起来是因为文件不是utf-8编码,可能是windows平台遇到gbk中文文档了。我试下能不能复现&fix这个问题。 PDF下载地址: 这个PDF 上传没报错 不过也是没读取成功; 对文件的大小有限制吗 https://jkwwt.acftu.org/jkwwtzcfg/202203/P020220325353963962863.pdf

JianxinMa commented 3 months ago

我的widnows机器不知为何无法复现此问题。

但,我还是在main分支增加了对非utf8(比如gbk)文件的处理,感兴趣的话可以试试拉取并安装最新的main分支,看看是否能工作。

相关commit: https://github.com/QwenLM/Qwen-Agent/commit/d9a37753f6dc86bbc33dd316a86b6fd1e4290e5c

wuybo commented 3 months ago

image

还是 assistant_rag.py 案例; 刚开始 我以为是缓存那边文件的影响,C:\Users\Administrator\AppData\Local\Temp\gradio; 我吧该目录下面的文件删除了,重新上传一个小的PDF 文件依然这个情况;不知道其他老师有没有遇到;我是拉去的最新版的Qwen_agent;

日志: `D:\pytho\Qwen-Agent-main\Scripts\python.exe D:\Backup\Downloads\Qwen-Agent-main\examples\assistant_rag.py Running on local URL: http://127.0.0.1:7860

Thanks for being a Gradio user! If you have questions or feedback, please join our Discord server and chat with us: https://discord.gg/feTf9x3ZSB

To create a public link, set share=True in launch(). 2024-07-05 16:11:53,558 - simple_doc_parser.py - 324 - INFO - Read parsed C:\Users\Administrator\AppData\Local\Temp\gradio\88c59b31ab04a1b2ddb9c941679bf6e787fc093b\开发者权利声明1.pdf from cache. 2024-07-05 16:11:53,558 - doc_parser.py - 114 - INFO - Start chunking C:\Users\Administrator\AppData\Local\Temp\gradio\88c59b31ab04a1b2ddb9c941679bf6e787fc093b\开发者权利声明1.pdf (开发者权利声明1.pdf)... 2024-07-05 16:11:53,558 - doc_parser.py - 132 - INFO - Finished chunking C:\Users\Administrator\AppData\Local\Temp\gradio\88c59b31ab04a1b2ddb9c941679bf6e787fc093b\开发者权利声明1.pdf (开发者权利声明1.pdf). Time spent: 0.0 seconds. ` 开发者权利声明(1).pdf

我上传的PDF文件:希望可以复现解决该问题;

JianxinMa commented 3 months ago

image

还是 assistant_rag.py 案例; 刚开始 我以为是缓存那边文件的影响,C:\Users\Administrator\AppData\Local\Temp\gradio; 我吧该目录下面的文件删除了,重新上传一个小的PDF 文件依然这个情况;不知道其他老师有没有遇到;我是拉去的最新版的Qwen_agent;

日志: `D:\pytho\Qwen-Agent-main\Scripts\python.exe D:\Backup\Downloads\Qwen-Agent-main\examples\assistant_rag.py Running on local URL: http://127.0.0.1:7860

Thanks for being a Gradio user! If you have questions or feedback, please join our Discord server and chat with us: https://discord.gg/feTf9x3ZSB

To create a public link, set share=True in launch(). 2024-07-05 16:11:53,558 - simple_doc_parser.py - 324 - INFO - Read parsed C:\Users\Administrator\AppData\Local\Temp\gradio\88c59b31ab04a1b2ddb9c941679bf6e787fc093b\开发者权利声明1.pdf from cache. 2024-07-05 16:11:53,558 - doc_parser.py - 114 - INFO - Start chunking C:\Users\Administrator\AppData\Local\Temp\gradio\88c59b31ab04a1b2ddb9c941679bf6e787fc093b\开发者权利声明1.pdf (开发者权利声明1.pdf)... 2024-07-05 16:11:53,558 - doc_parser.py - 132 - INFO - Finished chunking C:\Users\Administrator\AppData\Local\Temp\gradio\88c59b31ab04a1b2ddb9c941679bf6e787fc093b\开发者权利声明1.pdf (开发者权利声明1.pdf). Time spent: 0.0 seconds. ` 开发者权利声明(1).pdf

我上传的PDF文件:希望可以复现解决该问题;

这个似乎是另一个bug(如果用户只上传文件、不用文字发问就会触发)。。我们之前没测试到这种情况(缺少专业的测试)。我正在查为什么

看log截图似乎gbk编码的问题倒是解决了。

wuybo commented 3 months ago

还有一个问题老师;每次回答都会引用我上传的全部文件;比如我上传了两个文件,好比我上传了刑法的文件,和民法典的文件,我只需要他根据民法典的内容回答,这种在哪里可以设置下;

JianxinMa commented 3 months ago

还有一个问题老师;每次回答都会引用我上传的全部文件;比如我上传了两个文件,好比我上传了刑法的文件,和民法典的文件,我只需要他根据民法典的内容回答,这种在哪里可以设置下;

这种需要换Agent实现了,思路是在一开始先让llm判断下要读哪个文件(会增加一次llm调用所以没在Assistant里实现)。比如这个例子:https://github.com/QwenLM/Qwen-Agent/blob/main/examples/virtual_memory_qa.py (但是这个不是最高效的实现)

JianxinMa commented 3 months ago

这个似乎是另一个bug(如果用户只上传文件、不用文字发问就会触发)。。我们之前没测试到这种情况(缺少专业的测试)。我正在查为什么

main分支修复了“只传文件不打字时无回答“的bug。