Facico / Chinese-Vicuna

Chinese-Vicuna: A Chinese Instruction-following LLaMA-based Model —— 一个中文低资源的llama+lora方案,结构参考alpaca
https://github.com/Facico/Chinese-Vicuna
Apache License 2.0
4.14k stars 425 forks source link

中文乱码 #241

Closed NewEricWang closed 1 year ago

NewEricWang commented 1 year ago

直接运行:./scripts/generate.sh 里面配置:

!/bin/bash

TOT_CUDA="0" #Upgrade bitsandbytes to the latest version to enable balanced loading of multiple GPUs, for example: pip install bitsandbytes==0.39.0 BASE_MODEL="../LLM_pretrained_model/decapoda-research/llama-7b-hf" #"decapoda-research/llama-13b-hf"

BASE_MODEL="../pyllama/conv_models/7B"

LORA_PATH="../LLM_pretrained_model/Facico/Chinese-Vicuna-lora-7b-chatv1" #"./lora-Vicuna/checkpoint-final"

USE_LOCAL=1 # 1: use local model, 0: use huggingface model TYPE_WRITER=0 # 1 # whether output streamly

if [[ USE_LOCAL -eq 1 ]]

then

cp sample/instruct/adapter_config.json $LORA_PATH

fi

server_ip="192.168.0.22"

export LD_LIBRARY_PATH=/usr/local/cuda-11.3/lib64

Upgrade bitsandbytes to the latest version to enable balanced loading of multiple GPUs

CUDA_VISIBLE_DEVICES=${TOT_CUDA} python generate.py \ --model_path $BASE_MODEL \ --lora_path $LORA_PATH \ --use_local $USE_LOCAL \ --use_typewriter $TYPE_WRITER \ --server_ip $server_ip

在ubuntu18下firefox浏览器上, input: 告诉我肚子疼吃什么药? output: 你可以尝泡��药������������������荍荍荍荍荍荍荍荍荍荍荍荍荍荍荍荍荍荍荍荍荍荍荍荍荍荍荍荍荍荍荍荍荍荍荍荍荍荍荍荍荍荍荍荍荍荍荍��

请问这个乱码是怎么回事?该如何解决?

NewEricWang commented 1 year ago

又用Facico提供的测试脚本测试,脚本内容如下: import sys import torch from peft import PeftModel import transformers from transformers import LlamaTokenizer, LlamaForCausalLM

tokenizer = LlamaTokenizer.from_pretrained("../LLM_pretrained_model/decapoda-research/llama-7b-hf")

BASE_MODEL = "../LLM_pretrained_model/decapoda-research/llama-7b-hf"

tokenizer = LlamaTokenizer.from_pretrained("../FastChat_lm-sys_20230630/conv_models/7B") BASE_MODEL = "../FastChat_lm-sys_20230630/conv_models/7B"

model = LlamaForCausalLM.from_pretrained( BASE_MODEL, load_in_8bit=True, torch_dtype=torch.float16, device_map="auto", ) model.eval() inputs = "Hello, Where is the capital of the United States?" #"你好,美国的首都在哪里?" input_ids = tokenizer(inputs, return_tensors="pt")['input_ids'] print(input_ids) input_ids = input_ids.to('cuda') generation_output = model.generate( input_ids=input_ids, max_new_tokens=256, ) print(generation_output) print(tokenizer.decode(generation_output[0]))

model = PeftModel.from_pretrained( model, "../LLM_pretrained_model/Facico/Chinese-Vicuna-lora-7b-chatv1", torch_dtype=torch.float16, device_map={'': 0} )

inputs = "你好,中国的首都在哪里?" #"你好,美国的首都在哪里?" inputs = "告诉我肚子疼吃什么药?" input_ids = tokenizer(inputs, return_tensors="pt")['input_ids'] print(input_ids) input_ids = input_ids.to('cuda') generation_output = model.generate( input_ids=input_ids, max_new_tokens=256, ) print(generation_output) print(tokenizer.decode(generation_output[0]))

输出如下: tensor([[ 1, 15043, 29892, 6804, 338, 278, 7483, 310, 278, 3303, 3900, 29973]]) tensor([[ 1, 15043, 29892, 6804, 338, 278, 7483, 310, 278, 3303, 3900, 29973, 13, 1576, 7483, 310, 278, 3303, 3900, 338, 7660, 29892, 360, 29889, 29907, 29889, 13, 1576, 7483, 310, 278, 3303, 3900, 338, 7660, 29892, 360, 29889, 29907, 29889, 13, 1576, 7483, 310, 278, 3303, 3900, 338, 7660, 29892, 360, 29889, 29907, 29889, 13, 1576, 7483, 310, 278, 3303, 3900, 338, 7660, 29892, 360, 29889, 29907, 29889, 13, 1576, 7483, 310, 278, 3303, 3900, 338, 7660, 29892, 360, 29889, 29907, 29889, 13, 1576, 7483, 310, 278, 3303, 3900, 338, 7660, 29892, 360, 29889, 29907, 29889, 13, 1576, 7483, 310, 278, 3303, 3900, 338, 7660, 29892, 360, 29889, 29907, 29889, 13, 1576, 7483, 310, 278, 3303, 3900, 338, 7660, 29892, 360, 29889, 29907, 29889, 13, 1576, 7483, 310, 278, 3303, 3900, 338, 7660, 29892, 360, 29889, 29907, 29889, 13, 1576, 7483, 310, 278, 3303, 3900, 338, 7660, 29892, 360, 29889, 29907, 29889, 13, 1576, 7483, 310, 278, 3303, 3900, 338, 7660, 29892, 360, 29889, 29907, 29889, 13, 1576, 7483, 310, 278, 3303, 3900, 338, 7660, 29892, 360, 29889, 29907, 29889, 13, 1576, 7483, 310, 278, 3303, 3900, 338, 7660, 29892, 360, 29889, 29907, 29889, 13, 1576, 7483, 310, 278, 3303, 3900, 338, 7660, 29892, 360, 29889, 29907, 29889, 13, 1576, 7483, 310, 278, 3303, 3900, 338, 7660, 29892, 360, 29889, 29907, 29889, 13, 1576, 7483, 310, 278, 3303, 3900, 338, 7660, 29892, 360, 29889, 29907, 29889, 13, 1576, 7483, 310, 278, 3303, 3900, 338, 7660, 29892, 360, 29889, 29907, 29889, 13, 1576, 7483, 310, 278, 3303, 3900, 338, 7660, 29892, 360, 29889, 29907, 29889, 13, 1576, 7483, 310]], device='cuda:0') Hello, Where is the capital of the United States? The capital of the United States is Washington, D.C. The capital of the United States is Washington, D.C. The capital of the United States is Washington, D.C. The capital of the United States is Washington, D.C. The capital of the United States is Washington, D.C. The capital of the United States is Washington, D.C. The capital of the United States is Washington, D.C. The capital of the United States is Washington, D.C. The capital of the United States is Washington, D.C. The capital of the United States is Washington, D.C. The capital of the United States is Washington, D.C. The capital of the United States is Washington, D.C. The capital of the United States is Washington, D.C. The capital of the United States is Washington, D.C. The capital of the United States is Washington, D.C. The capital of the United States is Washington, D.C. The capital of the United States is Washington, D.C. The capital of the United States is Washington, D.C. The capital of tensor([[ 1, 29871, 31785, 235, 178, 140, 30672, 235, 133, 157, 30319, 234, 153, 191, 232, 147, 134, 231, 190, 131, 31882, 235, 144, 178, 30882]]) tensor([[ 1, 29871, 31785, 235, 178, 140, 30672, 235, 133, 157, 30319, 234, 153, 191, 232, 147, 134, 231, 190, 131, 31882, 235, 144, 178, 30882, 13, 235, 133, 157, 30319, 234, 153, 191, 30392, 30287, 31893, 31190, 235, 170, 132, 30210, 234, 153, 191, 234, 154, 158, 30214, 30682, 30815, 30392, 31272, 30909, 232, 147, 187, 232, 179, 232, 146, 154, 231, 191, 167, 30330, 232, 145, 142, 31074, 31391, 31149, 31221, 30667, 31570, 31674, 31558, 30210, 30267, 30651, 30557, 30392, 30287, 31959, 30682, 30815, 30417, 31931, 30909, 234, 191, 150, 31201, 235, 133, 157, 30319, 234, 153, 191, 30210, 235, 144, 178, 30383, 13, 29896, 29889, 29871, 233, 141, 154, 233, 179, 170, 30705, 235, 144, 178, 13, 233, 141, 154, 233, 179, 170, 30705, 235, 144, 178, 30682, 30651, 232, 187, 177, 31931, 234, 191, 150, 31201, 235, 133, 157, 30319, 234, 153, 191, 30214, 31570, 30573, 232, 177, 134, 30682, 30651, 232, 138, 233, 184, 233, 141, 233, 179, 170, 30705, 30834, 30210, 233, 178, 158, 232, 158, 30214, 232, 138, 233, 184, 233, 141, 233, 179, 170, 30705, 30834, 30210, 233, 178, 158, 232, 158, 30214, 232, 138, 233, 184, 233, 141, 233, 179, 170, 30705, 30834, 30210, 233, 178, 158, 232, 158, 30214, 232, 138, 233, 184, 233, 141, 233, 179, 170, 30705, 30834, 30210, 233, 178, 158, 232, 158, 30214, 232, 138, 233, 184, 233, 141, 233, 179, 170, 30705, 30834, 30210, 233, 178, 158, 232, 158, 30214, 232, 138, 233, 184, 233, 141, 233, 179, 170, 30705, 30834, 30210, 233, 178, 158, 232, 158, 30214, 232, 138, 233, 184, 233, 141, 233, 179, 170, 30705, 30834, 30210, 233, 178, 158, 232, 158, 30214, 232]], device='cuda:0') 告诉我肚子疼吃什么药? 肚子疼是一种常见的疼痛,可能是由于吸��受伤、压力或其他原因引起的。以下是一些可能有助于缓解肚子疼的药:

  1. 抗氧化药 抗氧化药可以帮助缓解肚子疼,因为它可以������氧化物的毛��,������氧化物的毛��,������氧化物的毛��,������氧化物的毛��,������氧化物的毛��,������氧化物的毛��,������氧化物的毛��,�

可以看到,英文正常的,但是中文会出现很多乱码。 运行在1080ti上,有如下一个警告: lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU! warn(msg)

会不会是精度不够导致的问题?

NewEricWang commented 1 year ago

我这个跑英文没有问题,不会出现乱码。跑中文,经常会出现乱码。 我用的库版本:transformers==4.30.2, tokenizers==0.13.3, sentencepiece==0.1.99

大佬们知道是什么原因吗?给指点一下,多谢了。

Gzj369 commented 1 year ago

很可能是python第3方包的版本问题,可以参考 https://github.com/Facico/Chinese-Vicuna/blob/master/docs/problems.md @NewEricWang

NewEricWang commented 1 year ago

Thanks a lot! @Gzj369

Gzj369 commented 1 year ago

这是来自QQ邮箱的自动回复邮件。您好,邮件我已经收到。看到后我一定会在第一时间内阅读并回复您。