如何支持过长文本的上下文语义关联的？

BomanNg commented 1 year ago

GPT-3.5Turpo的官方API文档的解释是，单轮的对话，包括发送的message和返回的message不超过4k个tokens。

而对于”读论文“的功能，我理解的是，需要一次性地把相当长的论文作为message发送到API才能返回效果较好的摘要。但是很显然，这样单轮的会话远远超过了4k个tokens的限制。

想请教一下是如何实现单轮的长对话的？

BomanNg commented 1 year ago

我注意到bridge_chatgpt.py中的一段代码：

                except Exception as e:
                    traceback.print_exc()
                    yield from update_ui(chatbot=chatbot, history=history, msg="Json解析不合常规") # 刷新界面
                    chunk = get_full_error(chunk, stream_response)
                    error_msg = chunk.decode()
                    if "reduce the length" in error_msg:
                        chatbot[-1] = (chatbot[-1][0], "[Local Message] Reduce the length. 本次输入过长，或历史数据过长. 历史缓存数据现已释放，您可以请再次尝试.")
                        history = []    # 清除历史
                    elif "Incorrect API key" in error_msg:
                        chatbot[-1] = (chatbot[-1][0], "[Local Message] Incorrect API key. OpenAI以提供了不正确的API_KEY为由，拒绝服务.")
                    elif "exceeded your current quota" in error_msg:
                        chatbot[-1] = (chatbot[-1][0], "[Local Message] You exceeded your current quota. OpenAI以账户额度不足为由，拒绝服务.")
                    else:
                        from toolbox import regular_txt_to_markdown
                        tb_str = '```\n' + traceback.format_exc() + '```'
                        chatbot[-1] = (chatbot[-1][0], f"[Local Message] 异常 \n\n{tb_str} \n\n{regular_txt_to_markdown(chunk.decode()[4:])}")
                    yield from update_ui(chatbot=chatbot, history=history, msg="Json异常" + error_msg) # 刷新界面
                    return

这是否也说明，对论文的处理还是没法绕过最长4k个tokens的限制呢？当超过了tokens limit则自动抛弃一部分上文。

binary-husky commented 1 year ago

https://github.com/binary-husky/chatgpt_academic/pull/366

Always-Naive commented 1 year ago

看了这个pr 依旧不是很理解这个过程大佬解释一下呗。如果请求openai api的时候， history的长度超过4k token，api是如何获得之前的 content 的信息的呢？我的理解是，假设我有 content1 content2 content3 在传到 content2 的时候我超过了token限制那么我带history请求的时候就会截断并舍弃掉一部分（e.g. content1的一部分被舍弃了）那么这时候我再问content1的相关内容的话我岂不是得到的都是chatgpt的猜测？

应该需要加embedding和索引来压缩历史信息并匹配才能做到理解全文，但是依旧会有最大长度限制。

zzcgithub commented 1 year ago

同问

DeriDer commented 1 year ago

@binary-husky 同问

BomanNg commented 1 year ago

看了这个pr 依旧不是很理解这个过程大佬解释一下呗。如果请求openai api的时候， history的长度超过4k token，api是如何获得之前的 content 的信息的呢？我的理解是，假设我有 content1 content2 content3 在传到 content2 的时候我超过了token限制那么我带history请求的时候就会截断并舍弃掉一部分（e.g. content1的一部分被舍弃了）那么这时候我再问content1的相关内容的话我岂不是得到的都是chatgpt的猜测？

应该需要加embedding和索引来压缩历史信息并匹配才能做到理解全文，但是依旧会有最大长度限制。

if conversation_cnt:
    for index in range(0, 2*conversation_cnt, 2):
        what_i_have_asked = {}
        what_i_have_asked["role"] = "user"
        what_i_have_asked["content"] = history[index]
        what_gpt_answer = {}
        what_gpt_answer["role"] = "assistant"
        what_gpt_answer["content"] = history[index+1]
        if what_i_have_asked["content"] != "":
            if what_gpt_answer["content"] == "": continue
            if what_gpt_answer["content"] == timeout_bot_msg: continue
            messages.append(what_i_have_asked)
            messages.append(what_gpt_answer)
        else:
            messages[-1]['content'] = what_gpt_answer['content']

看了这段代码，生成请求的message是从历史数据里读取的？。

再看这一段，就是将历史记录清楚掉。

                if "reduce the length" in error_msg:
                    chatbot[-1] = (chatbot[-1][0], "[Local Message] Reduce the length. 本次输入过长，或历史数据过长. 历史缓存数据现已释放，您可以请再次尝试.")
                    history = []    # 清除历史

在#366中意思应该就是用最简单的方法分割文段，分批请求，再将返回的结果拼接。

所以还是没能实现过长的上下文的衔接，有可能导致上文的概念在下文没有出现，或翻译成其它的意思。

BomanNg commented 1 year ago

@Always-Naive

应该需要加embedding和索引来压缩历史信息并匹配才能做到理解全文，但是依旧会有最大长度限制。

但是使用GPT3.5的话如何使用embedding呢？

Always-Naive commented 1 year ago

@BomanNg openai有embedding的接口每发送一段让GPT总结并调用embedding接口然后把所有的总结好的向量塞到一个数据库里并对应相关内容然后把用户的问题同样embedding 检索相似度最高的一个或几个embedding匹配把对应的内容添加到prompt里发送给gpt 不过这依旧是一个伪全文信息的实现所以我好奇这个pr的方法但试了一下发一段长点的pdf问abstract是什么返回的有些编造的内容基本可以判断只保留了比较靠后的 4000个token的信息但也有可能是我 naive 没看明白

binary-husky commented 1 year ago

@zzcgithub @DeriDer @Always-Naive @BomanNg @Hanzoe 我检查了一下之前的代码，然后重新改善了这个chatpdf的功能

def 解析PDF(file_name, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt):
    import tiktoken
    print('begin analysis on:', file_name)
    file_content, page_one = read_and_clean_pdf_text(file_name)  # 按章节切割PDF

    ############################## <第零步，切割PDF> ##################################
    # 切割PDF文件，每一块（尽量是完整的一个section，比如introduction，experiment等，必要时再进行切割）
    # 的长度必须小于 2500 个 Token
    TOKEN_LIMIT_PER_FRAGMENT = 2500

    from .crazy_utils import breakdown_txt_to_satisfy_token_limit_for_pdf
    from toolbox import get_conf
    enc = tiktoken.encoding_for_model(*get_conf('LLM_MODEL'))
    def get_token_num(txt): return len(enc.encode(txt))
    paper_fragments = breakdown_txt_to_satisfy_token_limit_for_pdf(
        txt=file_content,  get_token_fn=get_token_num, limit=TOKEN_LIMIT_PER_FRAGMENT)
    page_one_fragments = breakdown_txt_to_satisfy_token_limit_for_pdf(
        txt=str(page_one), get_token_fn=get_token_num, limit=TOKEN_LIMIT_PER_FRAGMENT//4)
    # 为了更好的效果，我们剥离Introduction之后的部分（如果有）
    paper_meta = page_one_fragments[0].split('introduction')[0].split('Introduction')[0].split('INTRODUCTION')[0]

    ############################## <第一步，从摘要中提取高价值信息，放到history中> ##################################
    final_results = []
    final_results.append(paper_meta)

    ############################## <第二步，迭代地历遍整个文章，提取精炼信息> ##################################
    i_say_show_user = f'首先你在英文语境下通读整篇论文。'; gpt_say = "[Local Message] 收到。"           # 用户提示
    chatbot.append([i_say_show_user, gpt_say]); yield from update_ui(chatbot=chatbot, history=[])    # 更新UI

    iteration_results = []
    last_iteration_result = paper_meta  # 初始值是摘要
    MAX_WORD_TOTAL = 4096
    n_fragment = len(paper_fragments)
    if n_fragment >= 20: print('文章极长，不能达到预期效果')
    for i in range(n_fragment):
        NUM_OF_WORD = MAX_WORD_TOTAL // n_fragment
        i_say = f"Read this section, recapitulate the content of this section with less than {NUM_OF_WORD} words: {paper_fragments[i]}"
        i_say_show_user = f"[{i+1}/{n_fragment}] Read this section, recapitulate the content of this section with less than {NUM_OF_WORD} words: {paper_fragments[i][:200]}"
        gpt_say = yield from request_gpt_model_in_new_thread_with_ui_alive(i_say, i_say_show_user,  # i_say=真正给chatgpt的提问， i_say_show_user=给用户看的提问
                                                                           llm_kwargs, chatbot, 
                                                                           history=["The main idea of the previous section is?", last_iteration_result], # 迭代上一次的结果
                                                                           sys_prompt="Extract the main idea of this section."  # 提示
                                                                        ) 
        iteration_results.append(gpt_say)
        last_iteration_result = gpt_say

    ############################## <第三步，整理history> ##################################
    final_results.extend(iteration_results)
    final_results.append(f'接下来，你是一名专业的学术教授，利用以上信息，使用中文回答我的问题。')
    # 接下来两句话只显示在界面上，不起实际作用
    i_say_show_user = f'接下来，你是一名专业的学术教授，利用以上信息，使用中文回答我的问题。'; gpt_say = "[Local Message] 收到。"
    chatbot.append([i_say_show_user, gpt_say])

    ############################## <第四步，设置一个token上限，防止回答时Token溢出> ##################################
    from .crazy_utils import input_clipping
    _, final_results = input_clipping("", final_results, max_token_limit=3200)
    yield from update_ui(chatbot=chatbot, history=final_results) # 注意这里的历史记录被替代了

binary-husky commented 1 year ago

@zzcgithub @DeriDer @Always-Naive @BomanNg @Hanzoe 我检查了一下之前的代码，然后重新改善了这个chatpdf的功能

def 解析PDF(file_name, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt):
    import tiktoken
    print('begin analysis on:', file_name)
    file_content, page_one = read_and_clean_pdf_text(file_name)

    ############################## <第零步，从摘要中提取高价值信息，放到history中> ##################################
    # 递归地切割PDF文件，每一块（尽量是完整的一个section，比如introduction，experiment等，必要时再进行切割）
    # 的长度必须小于 2500 个 Token
    TOKEN_LIMIT_PER_FRAGMENT = 2500

    from .crazy_utils import breakdown_txt_to_satisfy_token_limit_for_pdf
    from toolbox import get_conf
    enc = tiktoken.encoding_for_model(*get_conf('LLM_MODEL'))
    def get_token_num(txt): return len(enc.encode(txt))
    paper_fragments = breakdown_txt_to_satisfy_token_limit_for_pdf(
        txt=file_content,  get_token_fn=get_token_num, limit=TOKEN_LIMIT_PER_FRAGMENT)
    page_one_fragments = breakdown_txt_to_satisfy_token_limit_for_pdf(
        txt=str(page_one), get_token_fn=get_token_num, limit=TOKEN_LIMIT_PER_FRAGMENT//4)
    # 为了更好的效果，我们剥离Introduction之后的部分（如果有）
    paper_meta = page_one_fragments[0].split('introduction')[0].split('Introduction')[0].split('INTRODUCTION')[0]

    ############################## <第一步，从摘要中提取高价值信息，放到history中> ##################################
    final_results = []
    final_results.append(paper_meta)

    ############################## <第二步，迭代地历遍整个文章，提取精炼信息> ##################################
    i_say_show_user = f'首先你在英文语境下通读整篇论文。'; gpt_say = "[Local Message] 收到。"           # 用户提示
    chatbot.append([i_say_show_user, gpt_say]); yield from update_ui(chatbot=chatbot, history=[])    # 更新UI

    iteration_results = []
    last_iteration_result = paper_meta  # 初始值是摘要
    MAX_WORD_TOTAL = 4096
    n_fragment = len(paper_fragments)
    if n_fragment >= 20: print('文章极长，不能达到预期效果')
    for i in range(n_fragment):
        NUM_OF_WORD = MAX_WORD_TOTAL // n_fragment
        i_say = f"Read this section, recapitulate the content of this section with less than {NUM_OF_WORD} words: {paper_fragments[i]}"
        i_say_show_user = f"[{i+1}/{n_fragment}] Read this section, recapitulate the content of this section with less than {NUM_OF_WORD} words: {paper_fragments[i][:200]}"
        gpt_say = yield from request_gpt_model_in_new_thread_with_ui_alive(i_say, i_say_show_user,  # i_say=真正给chatgpt的提问， i_say_show_user=给用户看的提问
                                                                           llm_kwargs, chatbot, 
                                                                           history=["The main idea of the previous section is?", last_iteration_result], # 迭代上一次的结果
                                                                           sys_prompt="Extract the main idea of this section."  # 提示
                                                                        ) 
        iteration_results.append(gpt_say)
        last_iteration_result = gpt_say

    ############################## <第三步，整理history> ##################################
    final_results.extend(iteration_results)
    final_results.append(f'接下来，你是一名专业的学术教授，利用以上信息，使用中文回答我的问题。')
    # 接下来两句话只显示在界面上，不起实际作用
    i_say_show_user = f'接下来，你是一名专业的学术教授，利用以上信息，使用中文回答我的问题。'; gpt_say = "[Local Message] 收到。"
    chatbot.append([i_say_show_user, gpt_say])

    ############################## <第四步，设置一个token上限，防止回答时Token溢出> ##################################
    from .crazy_utils import input_clipping
    _, final_results = input_clipping("", final_results, max_token_limit=3200)
    yield from update_ui(chatbot=chatbot, history=final_results) # 注意这里的历史记录被替代了

总体而言，借助了之前翻译pdf中使用的文本切割算法，实现了比较精准的章节切割。
然后在对每个章节分别进行压缩，
每个章节压缩之后，会被用作下一个章节压缩时的上下文
（第一个章节的压缩以摘要为上下文）
整个文章都遍历一遍之后，再把搜集的数据进行拼接整合，超出token上限时需要对太长的部分进行选择性截断
最后作为上下文返回到对话界面 yield from update_ui(chatbot=chatbot, history=final_results) # 注意这里的历史记录被替代了

Always-Naive commented 1 year ago

感谢大佬合理多了强无敌

BomanNg commented 1 year ago

@Always-Naive

把对应的内容添加到prompt里发送给gpt

OpenAI的embedding我知道，但是embedding的responese body如：那么应该提取哪些字段，以怎样的形式发送到GPT3.5-Turpo的API呢？

BomanNg commented 1 year ago

@binary-husky 感谢更新。这样是更合理地切割文段了，并且依赖着上文对每个文段生成子摘要，再向子摘要的集合提供提示词进行问答。

    for i in range(n_fragment):
        NUM_OF_WORD = MAX_WORD_TOTAL // n_fragment
        i_say = f"Read this section, recapitulate the content of this section with less than {NUM_OF_WORD} words: {paper_fragments[i]}"
        i_say_show_user = f"[{i+1}/{n_fragment}] Read this section, recapitulate the content of this section with less than {NUM_OF_WORD} words: {paper_fragments[i][:200]}"
        gpt_say = yield from request_gpt_model_in_new_thread_with_ui_alive(i_say, i_say_show_user,  # i_say=真正给chatgpt的提问， i_say_show_user=给用户看的提问
                                                                           llm_kwargs, chatbot, 
                                                                           history=["The main idea of the previous section is?", last_iteration_result], # 迭代上一次的结果
                                                                           sys_prompt="Extract the main idea of this section."  # 提示
                                                                        ) 
        iteration_results.append(gpt_say)
        last_iteration_result = gpt_say

    ############################## <第三步，整理history> ##################################
    final_results.extend(iteration_results)

Always-Naive commented 1 year ago

@BomanNg embedding 是用来进行查找的作者大佬现在的实现已经够用了加入embedding是为了建立索引假设你有这么一段话 xxxxxxxx 他的 embedding [1 ,6 , 63, 2] , 那么xxxxxxx这段话就对应到这个向量上了你问了个问题只因是什么? 这个问题也可以被 embedding 假设他的embedding[1 ,7, 31, 1]

这样我们可以计算问题和很多个段落向量的相似度假设问题和xxxxxx最相似,那么我prompt前面就加上 xxxxxxxx这段话然后提交给chatgpt让他依据这段话回复只是说这里的相关信息一定程度上节省了token 一定程度上规避了maxtoken的限制但实际上并不是全文理解

BomanNg commented 1 year ago

@Always-Naive 了解，谢谢。我之前以为有方法能够直接将embedding压缩后vector作为上文提交给GPT的接口，再配合prompt对vector进行解析。原来只是通过embedding找到相近文本作为上文递交给GPT。这样的话还是没法让GPT运用到训练语料之外的信息。另外就是，openAI家的embedding也不便宜啊~~

binary-husky / gpt_academic

如何支持过长文本的上下文语义关联的？ #430