运行evaluate.py，出现list index out of range

fine1123 commented 1 year ago

显示报错 You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565, and set the legacy attribute accordingly. Loading checkpoint shards: 100%|██████████| 33/33 [00:08<00:00, 3.68it/s] 0it [00:00, ?it/s]Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation. 32it [1:50:15, 206.72s/it] 32it [00:00, ?it/s] Traceback (most recent call last): File "D:\mhf\TALLRec\evaluate.py", line 237, in fire.Fire(main) File "D:\Anaconda\envs\alpaca\lib\site-packages\fire\core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "D:\Anaconda\envs\alpaca\lib\site-packages\fire\core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "D:\Anaconda\envs\alpaca\lib\site-packages\fire\core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "D:\mhf\TALLRec\evaluate.py", line 204, in main test_data[i]['logits'] = logits[i] IndexError: list index out of range
运行finetune_rec.py时： base_model: str = "decapoda-research/llama-7b-hf", # the only required argument train_data_path: str = "data/movie/train.json", val_data_path: str = "data/movie/valid.json", output_dir: str = "./lora-alpaca_movie_64", sample: int = 64, seed: int = 3,

并且增加如下，发现没加的话，64个样本只运行三分钟： quantization_config = BitsAndBytesConfig(llm_int8_enable_fp32_cpu_offload=True)

注释了：
old_state_dict = model.state_dict model.statedict = ( lambda self, *, **: get_peft_model_state_dict( self, old_state_dict() ) ).get__(model, type(model))

结果显示： {'train_runtime': 1319.9201, 'train_samples_per_second': 0.145, 'train_steps_per_second': 0.002, 'train_loss': 0.9232605298360189, 'epoch': 3.0}
运行evaluate.py时： base_model: str = "decapoda-research/llama-7b-hf", lora_weights: str = "lora-alpaca_movie_64_2", test_data_path: str = "data/movie/test.json", result_json_data: str = "temp.json", train_sce = 'movie' test_sce = 'movie' model_name = 'lora-alpaca_movie_64_2' seed = 3 sample = 64
希望作者可以解答： 1、报错是什么原因呢 2、train_loss为0.923，auc的结果会不会效果很差谢谢！

SAI990323 commented 1 year ago

Could you provide your environment and script? It seems that you have used the latest package which doesn't fit for this repo. We use bitsandbytes==0.37.2, peft==0.3.0, transformers==4.28.0 I have updated the requirements.

fine1123 commented 1 year ago

感谢您及时的回复！
环境及代码：系统硬件：Windows 10，RTX3090 详细情况：here
还是有以下疑问 1、使用bitsandbytes==0.37.2, peft==0.3.0, transformers==4.28.0后，还是出现我上面所提到的错误。 2、我下载不了您所提供的adapter_model.bin文件，所以不能验证我微调后的适配器（16429k）结果有问题，还是其他方面的问题。

SAI990323 commented 1 year ago

It appears that there are some mismatches between the Windows and Linux environments, which necessitate code modifications. Unfortunately, I do not possess a Windows environment with a GPU. Additionally, may I inquire if you would be able to run the inference using the "evaluate.py" script we have provided? I have noticed that you have made some modifications to the code.

fine1123 commented 1 year ago

再次感谢您的回复！并且我获得了很大的进展。之前没有使用脚本文件运行代码是因为：lora权重大小不匹配，出现报错。看了之前的issues，可能是我下载的lora权重是16阶的。所以设置--resume_from_checkpoint None。
先说明使用脚本文件运行代码遇到的问题运行bash shell/instruct_7B.sh 0 3 运行bash ./shell/evaluate.sh 0 lora-alpaca_movie_3_64后，只评估32it就停止运行，auc：0.3801862641242938。
有趣的发现我之前未使用'evaluate.sh'就发现，也是在完成32it后报错，然后我修改'test.json'中样本的数量，30个样本时，完成1it后报错；60个样本，完成2it后报错；大约评估1/30的样本后，就会报错。
疑问 1、前后是否在存在某种联系？ 2、我从之前的Issues中看到，batch_size并不是RTX 3090 24GB的默认参数值。是否也与其有关？如果有关系，您能告诉我如何去修改batch_size吗？

SAI990323 commented 1 year ago

Qeustion1: 32it后报错是因为一共1000个样本，batch是32，32 * 32 = 1024 > 1000 正好推理完成，类似的30和60也一样 Question2: 与batch size大小本身没有关系，报错的原因应当是您修改了evaluate.py的代码导致返回的推理结果条目数和输入的条目数对不上，您可以检查一下evaluate函数的返回结果或者使用原始代码

fine1123 commented 1 year ago

感谢您耐心的解答！使用您提供的代码，不会出现错误。

SAI990323 / TALLRec

运行evaluate.py，出现list index out of range #39