Closed fine1123 closed 1 year ago
Could you provide your environment and script? It seems that you have used the latest package which doesn't fit for this repo. We use bitsandbytes==0.37.2, peft==0.3.0, transformers==4.28.0 I have updated the requirements.
感谢您及时的回复!
环境及代码: 系统硬件:Windows 10,RTX3090 详细情况:here
还是有以下疑问 1、使用bitsandbytes==0.37.2, peft==0.3.0, transformers==4.28.0后,还是出现我上面所提到的错误。 2、我下载不了您所提供的adapter_model.bin文件,所以不能验证我微调后的适配器(16429k)结果有问题,还是其他方面的问题。
It appears that there are some mismatches between the Windows and Linux environments, which necessitate code modifications. Unfortunately, I do not possess a Windows environment with a GPU. Additionally, may I inquire if you would be able to run the inference using the "evaluate.py" script we have provided? I have noticed that you have made some modifications to the code.
再次感谢您的回复!并且我获得了很大的进展。 之前没有使用脚本文件运行代码是因为:lora权重大小不匹配,出现报错。看了之前的issues,可能是我下载的lora权重是16阶的。所以设置--resume_from_checkpoint None。
先说明使用脚本文件运行代码遇到的问题
运行bash shell/instruct_7B.sh 0 3
运行bash ./shell/evaluate.sh 0 lora-alpaca_movie_3_64
后,只评估32it就停止运行,auc:0.3801862641242938。
有趣的发现 我之前未使用'evaluate.sh'就发现,也是在完成32it后报错,然后我修改'test.json'中样本的数量,30个样本时,完成1it后报错;60个样本,完成2it后报错;大约评估1/30的样本后,就会报错。
疑问 1、前后是否在存在某种联系? 2、我从之前的Issues中看到,batch_size并不是RTX 3090 24GB的默认参数值。是否也与其有关?如果有关系,您能告诉我如何去修改batch_size吗?
Qeustion1: 32it后报错是因为 一共1000个样本,batch是32,32 * 32 = 1024 > 1000 正好推理完成,类似的30和60也一样 Question2: 与batch size大小本身没有关系,报错的原因应当是您修改了evaluate.py的代码导致返回的推理结果条目数和输入的条目数对不上,您可以检查一下evaluate函数的返回结果或者使用原始代码
感谢您耐心的解答!使用您提供的代码,不会出现错误。
显示报错 You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565, and set the legacy attribute accordingly. Loading checkpoint shards: 100%|██████████| 33/33 [00:08<00:00, 3.68it/s] 0it [00:00, ?it/s]Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation. 32it [1:50:15, 206.72s/it] 32it [00:00, ?it/s] Traceback (most recent call last): File "D:\mhf\TALLRec\evaluate.py", line 237, in
fire.Fire(main)
File "D:\Anaconda\envs\alpaca\lib\site-packages\fire\core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "D:\Anaconda\envs\alpaca\lib\site-packages\fire\core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "D:\Anaconda\envs\alpaca\lib\site-packages\fire\core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "D:\mhf\TALLRec\evaluate.py", line 204, in main
test_data[i]['logits'] = logits[i]
IndexError: list index out of range
运行finetune_rec.py时: base_model: str = "decapoda-research/llama-7b-hf", # the only required argument train_data_path: str = "data/movie/train.json", val_data_path: str = "data/movie/valid.json", output_dir: str = "./lora-alpaca_movie_64", sample: int = 64, seed: int = 3,
并且增加如下,发现没加的话,64个样本只运行三分钟: quantization_config = BitsAndBytesConfig(llm_int8_enable_fp32_cpu_offload=True)
注释了:
old_state_dict = model.state_dict model.statedict = ( lambda self, *, **: get_peft_model_state_dict( self, old_state_dict() ) ).get__(model, type(model))
结果显示: {'train_runtime': 1319.9201, 'train_samples_per_second': 0.145, 'train_steps_per_second': 0.002, 'train_loss': 0.9232605298360189, 'epoch': 3.0}
运行evaluate.py时: base_model: str = "decapoda-research/llama-7b-hf", lora_weights: str = "lora-alpaca_movie_64_2", test_data_path: str = "data/movie/test.json", result_json_data: str = "temp.json", train_sce = 'movie' test_sce = 'movie' model_name = 'lora-alpaca_movie_64_2' seed = 3 sample = 64
希望作者可以解答: 1、报错是什么原因呢 2、train_loss为0.923,auc的结果会不会效果很差 谢谢!