codefuse-ai / MFTCoder

High Accuracy and efficiency multi-task fine-tuning framework for Code LLMs. This work has been accepted by KDD 2024.
Other
631 stars 67 forks source link

如何构建codefuse-llamacode的提问和终止符 #23

Closed wengyuan722 closed 10 months ago

wengyuan722 commented 11 months ago

我利用vllm实现对codefuse-llamacode32b-int4的api搭建,在构建ChatOpenAI时,我要怎么设置终止符以及prompt llm = ChatOpenAI( streaming=True, verbose=True, callbacks=[callback], openai_api_key="none", openai_api_base="https://u120320-a3ae-697ce3fb.neimeng.seetacloud.com:6443/v1", model_name="qwen", stop=["<|im_end|>", "<|im_start|>"], ) 另外有一个问题,为什么llamacode32b的gptq版本在使用vllm中的显存要求会与qwen72b int4一样,正常24b显存就能运行的普通的llamacode32b int,部署vllm要消耗4*24b.

guigui1123 commented 11 months ago

你好 1)prompt拼接逻辑如code里展示: HUMAN_ROLE_START_TAG = "<|role_start|>human<|role_end|>" BOT_ROLE_START_TAG = "<|role_start|>bot<|role_end|>" text = f"{HUMAN_ROLE_START_TAG}write a python function of quick sort.{BOT_ROLE_START_TAG}"

2)vllm 停止符设置参数vllm 官方文档:https://github.com/vllm-project/vllm/, stop_token_ids=[tokenizer.eos_token_id] from vllm.sampling_params import SamplingParams

    sampling_kwargs = {
        "stop_token_ids": stop_words_ids,
        "early_stopping": False,
        "top_p": generation_config.top_p}

3) 请确认下是否vllm支持gptq-4bit量化后的模型,目前vllm目前官方应该不支持gptq量化后的模型,只支持【awq、squeezellm】

wengyuan722 commented 10 months ago

目前已经可以通过编译实现gptq的vllm,我72b int4已经跑通,目前官方版本还没有merge

wengyuan722 commented 10 months ago

请教一下,codefuse-llamacode的停止符是什么,我把stop设置成<|role_end|>和无效

wengyuan722 commented 10 months ago

codefuse-llamacode 32b -int4我要运行了,就是调api,不能停止,还有回复乱码,看能不能和你们一起把codefuse-llamacode 32b -int4跑通

wengyuan722 commented 10 months ago

prompt不是,是我原来自己写对的其他模型的prompt,那我先修改

另外一个问题是为什么32b的gptq的消耗的资源回合qwen 72b一样,你们能不能一起测试,看能不能优化

guigui1123 commented 10 months ago

1)乱码的问题确认下prompt拼接逻辑是否是上述提供的模版 2)stop id改为</s> 3)tokenizer的eos_token设置为</s> 4)可以打印出来tokenizer 的信息确认下是否有误 5)codefuse-llamacode32b-int4按照git提供的demo是可以跑通的,确认下是否是镜像问题

wengyuan722 commented 10 months ago

https://github.com/chu-tianxiang/vllm-gptq 你们看能不能一起测试,这个版本可以跑gptq

wengyuan722 commented 10 months ago

要编译安装,环境要和看你们cuda版本,建议torch-2.1.0,xformers==0.0.22.post7

1. install torch xformers

pip install torch==2.1.0 xformers==0.0.22.post7 --index-url https://download.pytorch.org/whl/cu118

2. install vllm requirements

pip install -r requirements.txt

3. install packaging in pyproject.toml which not presented in requirements.txt

pip install packaging

4. install vllm from source with --no-build-isolation

pip install -e . --no-build-isolation

guigui1123 commented 10 months ago

https://github.com/codefuse-ai/MFTCoder/issues/23#issuecomment-1851204923 vllm-gptq 后续我们关注下,跑通了会回复你

wengyuan722 commented 10 months ago

希望能够跑通,同时优化,正常3090就能跑通int,居然要消耗4*24的资源,正常48G就能跑通

wengyuan722 commented 10 months ago

另外一个问题,是4k扩展到16k,我看抱脸回复,修改config.json "rope_scaling": { "factor": 4.0, "type": "linear" } 帮忙确认一下,这个能不能实现扩展

twelveand0 commented 10 months ago

另外一个问题,是4k扩展到16k,我看抱脸回复,修改config.json "rope_scaling": { "factor": 4.0, "type": "linear" } 帮忙确认一下,这个能不能实现扩展

由于这个模型我们微调时设置的rope base是10000,支持不了16K context length,因此直接扩展到16K不太行

wengyuan722 commented 10 months ago

@twelveand0 多谢,那预计可以扩展到多大,是不是按照上面配置进行扩展 另外一个是希望增加function的训练,我测试了,用 Question: the input question you must answer Thought: you should always think about what to do Action: the action to take, should be one of [read_data, plot_grouped_bar_chart] Action Input: the input to the action Observation: the result of the action ... (this Thought/Action/Action Input/Observation can be repeated zero or more times) Thought: I now know the final answer Final Answer: the final answer to the original input question这一套,准确率不高

wengyuan722 commented 10 months ago

另外试了一下vllm,会重新构造新的prompt,跑出来是乱码 prompt: "[INST] <>\nYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\n<>\n\n<|role_start|>human<|role_end|>write a python function of quick sort<|role_start|>bot<|role_end|> [/INST]" 不知道要不要把sys去掉

guigui1123 commented 10 months ago

现在gptq-4bit-vllm可能存在潜在bug,目前还没有merge到vllm主分支,建议你用vllm load codefuse-codellama 没有量化版本的模型,我线下测过是没有问题的,prompt的拼接逻辑还是上面提到的逻辑

wengyuan722 commented 10 months ago

@guigui1123 多谢,主要是用没有量化版本我资源不够,那只能用常规部署。另外咨询一下,为什么我要求续写,代码老是重新读取文件。要求做数据分析,但是每次提出问题,都会重新read_csv,这个怎么解决。我的prompt如下: 你是一个数据分析师,正在使用python代码在nootbook上进行数据分析,请在当前代码基础上续写。你可以处理用户上传到电脑上的文件,一定要记住文件默认存储路径是。

有没有方式让大模型执行续写功能,而不是重新开始读取文件

twelveand0 commented 10 months ago

@guigui1123 多谢,主要是用没有量化版本我资源不够,那只能用常规部署。另外咨询一下,为什么我要求续写,代码老是重新读取文件。要求做数据分析,但是每次提出问题,都会重新read_csv,这个怎么解决。我的prompt如下:

你是一个数据分析师,正在使用python代码在nootbook上进行数据分析,请在当前代码基础上续写。你可以处理用户上传到电脑上的文件,一定要记住文件默认存储路径是。 有没有方式让大模型执行续写功能,而不是重新开始读取文件

你使用的推理格式是什么?是下面这种格式吗

<|role_start|>human<|role_end|># language: Python
from typing import List

def separate_paren_groups(paren_string: str) -> List[str]:
    """ Input to this function is a string containing multiple groups of nested parentheses. Your goal is to
    separate those group into separate strings and return the list of those.
    Separate groups are balanced (each open brace is properly closed) and not nested within each other
    Ignore any spaces in the input string. 
    >>> separate_paren_groups('( ) (( )) (( )( ))')
    ['()', '(())', '(()())']
    """
<|role_start|>bot<|role_end|>
wengyuan722 commented 10 months ago

@twelveand0 多谢,我是用你们的示例构建的,然后自己改造成流式输出

@app.post("/v1/chat/completions", response_model=ChatCompletionResponse) async def create_chat_completion(request: ChatCompletionRequest): global model, tokenizer print(111,request.messages)

#prompt='你是一个数据分析师,请使用python进行数据分析。我已提供文件titanic.csv,文件地址是/mnt/bst/,请你分析一下这个文件,先读取这个文件,请写出python代码'
prompt=request.messages[0].content
prompt = prompt if prompt.endswith('\n') else f'{prompt}\n'
inputs =  f"<|role_start|>human<|role_end|>{prompt}<|role_start|>bot<|role_end|>"

input_ids = tokenizer.encode(inputs, 
                              return_tensors="pt", 
                              padding=True, 
                              add_special_tokens=False).to("cuda")

if request.stream:
    streamer = TextIteratorStreamer(tokenizer, skip_prompt=True)
    #streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
    generation_params = {
        "eos_token_id": tokenizer.eos_token_id,
        "pad_token_id": tokenizer.eos_token_id,
        "do_sample": True,
        "temperature": 0.7,
        "top_p": 0.95,
        "top_k": 40,
        "max_new_tokens": 512,
        "repetition_penalty": 1.1
    }

    generation_config = GenerationConfig(**generation_params)
    generate = predict2(request.model,input_ids, streamer,generation_config)
    return EventSourceResponse(generate, media_type="text/event-stream")
wengyuan722 commented 10 months ago

多轮对话,我就直接把历史记录拼接到prompt里面,现在的问题是多轮对话,就是会重新读取文件,不会参考历史的代码

guigui1123 commented 10 months ago

你好,vllm官方已经支持了gptq量化算法,直接更新到最新的vllm=0.2.6版本就可以,目前测下来是OK的

wengyuan722 commented 10 months ago

@guigui1123 多谢,能不能分享一下调用的脚本,怎么设计prompt和终止符

wengyuan722 commented 10 months ago

@guigui1123 早上测了,可以跑了,返回正常,但是还有一个问题,为什么34b-int4的大小消耗的资源是4 24GB的显存,正常应该2 24就够了 。我跑qwen72b-int4也就消耗4 * 24GB,这块看能不能帮忙查一下原因

wengyuan722 commented 10 months ago

@guigui1123 刚才测试了CodeLlama-34B-Instruct-GPTQ,使用48G显存,可以跑通,也验证了正常34b-int4只要48G的显存,能不能帮忙查看一下codeconfu-llama34b-int4为什么48G会爆显存。

INFO 12-21 15:19:00 api_server.py:727] args: Namespace(allow_credentials=False, allowed_headers=[''], allowed_methods=[''], allowed_origins=['*'], block_size=16, chat_template=None, disable_log_requests=False, disable_log_stats=False, download_dir=None, dtype='auto', enforce_eager=False, engine_use_ray=False, gpu_memory_utilization=0.9, host='127.0.0.1', load_format='auto', max_context_len_to_capture=8192, max_log_len=None, max_model_len=8192, max_num_batched_tokens=8192, max_num_seqs=256, max_paddings=256, max_parallel_loading_workers=None, model='/root/autodl-tmp/CodeLlama-34B-Instruct-GPTQ', pipeline_parallel_size=1, port=6006, quantization='gptq', response_role='assistant', revision=None, seed=0, served_model_name='llama', ssl_certfile=None, ssl_keyfile=None, swap_space=4, tensor_parallel_size=1, tokenizer=None, tokenizer_mode='auto', tokenizer_revision=None, trust_remote_code=True, worker_use_ray=False) WARNING 12-21 15:19:00 config.py:175] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models. WARNING 12-21 15:19:00 config.py:187] gptq does not support CUDA graph yet. Disabling CUDA graph. INFO 12-21 15:19:00 llm_engine.py:73] Initializing an LLM engine with config: model='/root/autodl-tmp/CodeLlama-34B-Instruct-GPTQ', tokenizer='/root/autodl-tmp/CodeLlama-34B-Instruct-GPTQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=gptq, enforce_eager=True, seed=0) INFO 12-21 15:19:20 llm_engine.py:223] # GPU blocks: 7820, # CPU blocks: 1365 WARNING 12-21 15:19:22 api_server.py:123] No chat template provided. Chat API will not work. INFO: Started server process [887] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://127.0.0.1:6006 (Press CTRL+C to quit) INFO: 61.241.216.31:0 - "GET / HTTP/1.1" 404 Not Found

twelveand0 commented 10 months ago

@guigui1123 刚才测试了CodeLlama-34B-Instruct-GPTQ,使用48G显存,可以跑通,也验证了正常34b-int4只要48G的显存,能不能帮忙查看一下codeconfu-llama34b-int4为什么48G会爆显存。

INFO 12-21 15:19:00 api_server.py:727] args: Namespace(allow_credentials=False, allowedheaders=[''], allowedmethods=[''], allowed_origins=['*'], block_size=16, chat_template=None, disable_log_requests=False, disable_log_stats=False, download_dir=None, dtype='auto', enforce_eager=False, engine_use_ray=False, gpu_memory_utilization=0.9, host='127.0.0.1', load_format='auto', max_context_len_to_capture=8192, max_log_len=None, max_model_len=8192, max_num_batched_tokens=8192, max_num_seqs=256, max_paddings=256, max_parallel_loading_workers=None, model='/root/autodl-tmp/CodeLlama-34B-Instruct-GPTQ', pipeline_parallel_size=1, port=6006, quantization='gptq', response_role='assistant', revision=None, seed=0, served_model_name='llama', ssl_certfile=None, ssl_keyfile=None, swap_space=4, tensor_parallel_size=1, tokenizer=None, tokenizer_mode='auto', tokenizer_revision=None, trust_remote_code=True, worker_use_ray=False) WARNING 12-21 15:19:00 config.py:175] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models. WARNING 12-21 15:19:00 config.py:187] gptq does not support CUDA graph yet. Disabling CUDA graph. INFO 12-21 15:19:00 llm_engine.py:73] Initializing an LLM engine with config: model='/root/autodl-tmp/CodeLlama-34B-Instruct-GPTQ', tokenizer='/root/autodl-tmp/CodeLlama-34B-Instruct-GPTQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=gptq, enforce_eager=True, seed=0) INFO 12-21 15:19:20 llm_engine.py:223] # GPU blocks: 7820, # CPU blocks: 1365 WARNING 12-21 15:19:22 api_server.py:123] No chat template provided. Chat API will not work. INFO: Started server process [887] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://127.0.0.1:6006 (Press CTRL+C to quit) INFO: 61.241.216.31:0 - "GET / HTTP/1.1" 404 Not Found

为什么34B INT4需要48G显存,我们使用单张A10 (24G)部署过codefuse-codellama-34b-int4,如果按照你说的显存要求,单张是无法部署的

wengyuan722 commented 10 months ago

使用vllm需要更大的显存,不知道什么原因,24G不能跑vllm

wengyuan722 commented 10 months ago

另外24G跑不了多轮对话,跑几下就爆了

wengyuan722 commented 10 months ago

另外你们还会继续更新开源模型不,我试了codefuse llamacode 现在就python准确率高一点,对于sql,echart等其他语言准确率不高,另外对于上下文理解不如通用模型,对于rag效果不是很好。

guigui1123 commented 10 months ago

@wengyuan722 麻烦确认下vllm的超参【gpu_memory_utilization 和 max_model_len】因为vllm是预分配显存,不建议gpu_memory_utilization 设置过大,我们这边测过codefuse-codellama-gptq-4bit的vllm A10单卡是ok的

wengyuan722 commented 10 months ago

@guigui1123 多谢,我设置成0.9,gpu_memory_utilization=0.9,max_model_len=8192,建议调整成多少