ChatGLM3-6B-32k hang up the machine for more than 30 minutes

estherche113 commented 6 months ago

System Info / 系統信息

transformers 4.37.2 Cuda 11.8 Using two NVIDIA GeForce RTX3090 24GB GPUs

Who can help? / 谁可以帮助到您？

No response

Information / 问题信息

[ ] The official example scripts / 官方的示例脚本
[ ] My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

The LLM takes more than 30 minutes to generate a long text stream which repeats some contents hundreds of times. The LLM looked unstable. Does anyone know how to fix this bug?

Detail to repeat the problem: We loaded model on 2 GPUs, and called the model twice using a certain system prompt + query sequence. The above issue occurs at the second model.chat() call. Following is the code triggering this error -

=================================================================

#ChatGLM3-6B-32k, loaded on 2 * RTX3090 24GB GPU cards
import random
import numpy as np
import torch

# from torch import cuda
from torch.nn import Module
from transformers import AutoTokenizer, AutoModel, AutoConfig
from langchain.llms.base import LLM
import os
from typing import Dict, Union, Optional, List

GPUS_AVAILABLE=2
GLM3_MODEL_PATH = "D:\projectGLM3\ChatGLM3\chatglm3-6b-32k\chatglm3-6b-32k"

"""From https://github.com/THUDM/ChatGLM3/blob/main/langchain_demo/ChatGLM3.py"""
class ChatGLM3(LLM):
    max_token: int = 8192
    do_sample: bool = True
    temperature: float = 0.1
    top_p = 0.8
    tokenizer: object = None
    model: object = None
    history: List = []
    has_search: bool = False

    def __init__(self):
        super().__init__()

    @property
    def _llm_type(self) -> str:
        return "ChatGLM3"

    def load_model(self, llm_device="gpu", model_name_or_path=GLM3_MODEL_PATH):
        model_config = AutoConfig.from_pretrained(model_name_or_path, trust_remote_code=True)
        self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path,trust_remote_code=True)
        self.model = load_model_on_gpus(GLM3_MODEL_PATH,  num_gpus=GPUS_AVAILABLE, config=model_config)

    def _tool_history(self, prompt: str):
        raise Exception("Dummy")

    def _extract_observation(self, prompt: str):
        raise Exception("Dummy")

    def _extract_tool(self):
        raise Exception("Dummy")

    def _call(self, prompt: str, history: List = [], stop: Optional[List[str]] = ["<|user|>"]):
        raise Exception("Dummy")

def load_tokenizer():
    tokenizer = AutoTokenizer.from_pretrained(GLM3_MODEL_PATH, trust_remote_code=True)
    return tokenizer

"""From  https://github.com/THUDM/ChatGLM2-6B/blob/main/utils.py"""
def auto_configure_device_map(num_gpus: int) -> Dict[str, int]:
    # transformer.word_embeddings 占用1层
    # transformer.final_layernorm 和 lm_head 占用1层
    # transformer.layers 占用 28 层
    # 总共30层分配到num_gpus张卡上
    num_trans_layers = 28
    per_gpu_layers = 30 / num_gpus

    # bugfix: 在linux中调用torch.embedding传入的weight,input不在同一device上,导致RuntimeError
    # windows下 model.device 会被设置成 transformer.word_embeddings.device
    # linux下 model.device 会被设置成 lm_head.device
    # 在调用chat或者stream_chat时,input_ids会被放到model.device上
    # 如果transformer.word_embeddings.device和model.device不同,则会导致RuntimeError
    # 因此这里将transformer.word_embeddings,transformer.final_layernorm,lm_head都放到第一张卡上
    # 本文件来源于https://github.com/THUDM/ChatGLM-6B/blob/main/utils.py
    # 仅此处做少许修改以支持ChatGLM2
    device_map = {
        'transformer.embedding.word_embeddings': 0,
        'transformer.encoder.final_layernorm': 0,
        'transformer.output_layer': 0,
        'transformer.rotary_pos_emb': 0,
        'lm_head': 0
    }

    used = 2
    gpu_target = 0
    for i in range(num_trans_layers):
        if used >= per_gpu_layers:
            gpu_target += 1
            used = 0
        assert gpu_target < num_gpus
        device_map[f'transformer.encoder.layers.{i}'] = gpu_target
        used += 1

    return device_map

"""From  https://github.com/THUDM/ChatGLM2-6B/blob/main/utils.py"""
def load_model_on_gpus(checkpoint_path: Union[str, os.PathLike], num_gpus: int = 2,
                       device_map: Optional[Dict[str, int]] = None, **kwargs) -> Module:
    if num_gpus < 2 and device_map is None:
        model = AutoModel.from_pretrained(checkpoint_path, trust_remote_code=True, **kwargs).half().cuda()
    else:
        from accelerate import dispatch_model

        model = AutoModel.from_pretrained(checkpoint_path, trust_remote_code=True, **kwargs).half()

        if device_map is None:
            device_map = auto_configure_device_map(num_gpus)

        model = dispatch_model(model, device_map=device_map)

    return model

TOKENIZER = load_tokenizer()

CLIENT = ChatGLM3()
CLIENT.load_model()

TEMPERATURE = 0.1

def set_random_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)
set_random_seed(3234)

query = "總結2023施政報告"

"""Step 1"""
print("\n\n=============Entering the 1-th call...=============")
system_info_1 = {
    'role': 'system', 
    'content': '你是一位专家研究助理。 用户将询问您一个查询问题。\n针对查询问题建议最多五个其他相关问题，以帮助他们找到所需的信息。\n仅建议简短的问题，不建议复合句。 提出涵盖主题不同方面的各种问题。\n确保它们是完整的问题，并且与原始问题相关。\n每行输出一个问题。 不要给问题编号。\n=====例子=====\n原始查询：\n     总结一下文档\n回答：\n     文件索引\n     表中的内容\n     文件结论'
}
print(system_info_1)

print("\n=============Calling ChatGLM3...=============")
reply_1, history_1 = CLIENT.model.chat(
    tokenizer = TOKENIZER, 
    query = query, 
    history = [system_info_1],
    temperature = TEMPERATURE
)
print(reply_1+"\n\n")

"""Step 2"""
print("\n\n=============Entering the 2-th call...=============")
formatted_prompt="你是一位知识助手，使用以下知识库和聊天记录来尽力回答问题。\n\n- 知识库提供了以下相关知识：\n========\n资料来源：policy-full_en.pdf，3%\nI. Foreword\n\n1. This is my second Policy Address as your Chief Executive. In putting pen to paper for the 2023 Policy Address, I pondered what measures implemented over the past year have been most well‑received by our community. I contemplated, too, which areas we should deepen and what new areas and development directions we should pursue. Throughout the process, I felt the weight of my responsibilities on my shoulders.\n\n2. Since I took office, I have led the Government to embrace a result‑oriented culture, building an administration with the focus on actions and delivery of results, as well as strengthening our co‑operation and team spirit.\n\n========\n资料来源：policy-full_en.pdf，75%\n73\n\n2.\n\nIntroduce a bill into the LegCo within 2024 to enhance the protection of cybersecurity of critical infrastructure. (SB)\n\n3.\n\nSet up a national security exhibition gallery in 2024 to enhance the promotion of national security, with the target attendance of no less than 100 000 in 2025, and provide community courses on national security for at least 2 600 trainees in 2025. Based on the sharing of the message on safeguarding national security with 30 persons by every trainee, the number of beneficiaries will be no fewer than 78 000. (Committee for Safeguarding National Security of the Hong Kong Special Administrative Region, SB)\n\nPatriotic Education 4.\n\nPromote patriotic education in the community through the set‑up of the Hong Kong Museum of the War of Resistance and Coastal Defence within 2024 and the Chinese Culture Promotion Office under the LCSD in Q2 2024. Starting from 2024, LCSD will organise:\n\n\n\nover\t50 activities to promote Chinese culture and history annually; and\n\n========\n资料来源：policy-full_en.pdf，75%\n2. Monitoring progress aside, the setting of these indicators has enabled timely intervention or adjustments when needed to help the departments to address problems encountered. This will help foster the result‑oriented culture in the Government. Overall, the performance of the departments in the past year has generally met my expectations.\n\n3. For the 2023 Policy Address, I have set a total of 150 indicators, of which 73 are new indicators. The remaining 77 were introduced in the 2022 Policy Address, which are still on‑going and valid. The details are set out as follows:\n\nIndicators for Specified Tasks in 2023 Policy Address\n\n(I) New Indicators\n\nUphold the Principle of “One Country, Two Systems” and Safeguard National Security 1.\n\nPress ahead with the legislation on Article 23 of the Basic Law by putting forward effective legislative proposals, with a view to completing the legislative work within 2024. (SB)\n\n73\n\n2.\nNone\n\n==========\n- 你需要回答以下新问题。\n新问题：\n總結2023施政報告\n\n==========\n- 如果找到答案，请以简洁的方式写出答案，使用与问题相同的语言作答\n- 如果不知道答案或知识库中不包含答案，回答‘抱歉，我不知道’。不要编造任何答案。\n- 在答案中始终添加直接用于得出答案的来源和页数列表，排除与最终答案无关的来源，合并答案中的重复来源。\n- 提示，知识库在上文中提供了以下来源：policy-full_en.pdf\n有帮助的答案：\n资料来源："
system_info_2 = [{
    "role": "system",
    "content": formatted_prompt
}]
print(system_info_2)

print("\n=============Calling ChatGLM3...=============")
reply_2, history_2 = CLIENT.model.chat(     #<=Run into a long loop, generating repeated content
    TOKENIZER, 
    query, 
    history = system_info_2,
    temperature = TEMPERATURE,
)
print(reply_2)

========================================================================

Expected behavior / 期待表现

We expect it to produce an answer within a minute, but the LLM looked unstable. Does anyone know how to fix this bug?

Sometimes, changing the seed, temperature or modifying the characters of the input may bypass this bug but do not know when it will come out again!

hslam007 commented 6 months ago

Does anyone know whether the random Seed must be changed every time or can be fixed at a value?

zRzRzRzRzRzRzR commented 6 months ago

可以固定，关于你描述的重复100次，是什么意思呢

estherche113 commented 6 months ago

可以固定，关于你描述的重复100次，是什么意思呢

模型会开始重复一条句子八百余次直到突然结束，耗时超过40分钟。并没有发现报错信息以下是模型的生成内容（中间重复部分省略）：

2023施政报告主要涉及以下几个方面：

1. 维护“一国两制”原则和保障国家安全：报告提出要推进《基本法》第23条立法工作，以维护国家主权、安全和发展利益。

2. 设立国家安全展览馆：报告提出在2024年设立国家安全展览馆，以增强国家安全意识，提高民众的国家安全意识。

3. 加强爱国教育：报告提出在2024年设立香港战争抵抗和海岸防御博物馆，以及中国文化交流办公室，以推广爱国教育。

4. 提升政府效能：报告强调要推进结果导向文化，建立以行动和结果为导向的政府，并加强部门间的合作和团队精神。

5. 提升网络安全保护：报告提出在2024年将网络安全立法引入立法会，以保护关键基础设施的网络安全。

6. 提升政府效能：报告强调要推进结果导向文化，建立以行动和结果为导向的政府，并加强部门间的合作和团队精神。

7. 提升政府效能：报告强调要推进结果导向文化，建立以行动和结果为导向的政府，并加强部门间的合作和团队精神。

……

862. 提升政府效能：报告强调要推进结果导向文化，建立以行动和结果为导向的政府，并加强部门间的合作和团队精神。

863. 提升政府效能：报告强调要推进结果导向文化，建立以行动和结果为导向的政府，并加强部门间的合作和团队精神。

864. 提升政府效能：报告强调要推进结果导向文化，建立以行动和结果为导向的政府，并加强部门间的合作和团队精神。

865. 提升政府效能：报告强调要推进结果导向

zRzRzRzRzRzRzR commented 6 months ago

哦，那你尝试尝试在repetition_penalty调到1.2（在composite demo中可以实现），如果你要在其他demo，你需要按照这个demo中的方式修改代码

estherche113 commented 6 months ago

哦，那你尝试尝试在repetition_penalty调到1.2（在composite demo中可以实现），如果你要在其他demo，你需要按照这个demo中的方式修改代码

设置repetition_penalty=1.2之后生成结果正常了，十分感谢！比max_new_tokens直接截断效果更自然

hk92292831 commented 6 months ago

哦，那你尝试尝试在repetition_penalty调到1.2（在composite demo中可以实现），如果你要在其他demo，你需要按照这个demo中的方式修改代码

Will any Error or Status code be returned if the repetitive replies are stopped by this "repetition_penalty"? Then we can know and retry to generate the texts again with another random seed.

THUDM / ChatGLM3