环境问题 - Githubissues

Excy-an commented 10 months ago

在配置环境过程中，总是报错，想了解一下cuda和pytorch使用的版本是？以及能否在2080的GPU上进行测试？

SAI990323 commented 10 months ago

我们使用的是cuda11.7和pytorch2.0.1,我理解2080和3090应该不会有这种环境上的gap，如果你使用了其他版本的cuda只需要安装对应版本的包就可以了

barton-wa commented 10 months ago

fintune_rec.py文件 computer_metrics函数，计算ROC不应该是计算pre和label之间的roc吗，怎么是pre[0]和pre[1]

SAI990323 commented 10 months ago

fintune_rec.py文件 computer_metrics函数，计算ROC不应该是计算pre和label之间的roc吗，怎么是pre[0]和pre[1]

pre[0] 和 pre[1] 对应 preprocess_logits_for_metrics 函数的返回值，已经在那里处理了label了,所以不需要在这里再处理了

Excy-an commented 9 months ago

您好，在执行您的finetuneRec文件中报错： File "/home1/ajx/TALLRec/finetune_rec.py", line 325, in fire.Fire(train) File "/home1/ajx/anaconda3/envs/Tall/lib/python3.9/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/home1/ajx/anaconda3/envs/Tall/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/home1/ajx/anaconda3/envs/Tall/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/home1/ajx/TALLRec/finetune_rec.py", line 210, in train model.print_trainable_parameters() # Be more transparent about the % of trainable params. AttributeError: 'NoneType' object has no attribute 'print_trainable_parameters' 我的文件架构是： base_model: str = "./LlamaN", # the only required argument train_data_path: str = "./data/movie/train.json", val_data_path: str = "./data/movie/valid.json", output_dir: str = "./out", sample: int = -1,

Excy-an commented 9 months ago

output_dir="./out" base_model="/home1/ajx/TALLRec/LlamaN" train_data="./data/movie/train.json" val_data="./data/movie/valid.json" instruction_model="/home1/ajx/TALLRec/alpaca-lora-7B" 麻烦您帮忙看一下有什么问题，这个LlamaN是那个llama的hf格式，instruction_model是您的权重 weights for the instruction tuning model

SAI990323 commented 9 months ago

您好看上去像是peft版本的环境问题

zzerrrro commented 9 months ago

您好，在执行您的finetuneRec文件中报错： File "/home1/ajx/TALLRec/finetune_rec.py", line 325, in fire.Fire(train) File "/home1/ajx/anaconda3/envs/Tall/lib /python3.9/site-packages/fire/core.py”，第141行，在Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) 文件“/home1/ajx/anaconda3/envs/Tall/lib / python3.9/site-packages/fire/core.py"，第475行，在_Fire 组件中，remaining_args = _CallAndUpdateTrace( File "/home1/ajx/anaconda3/envs/Tall/lib/python3.9/site- packages/fire /core.py”，第 691 行，在 _CallAndUpdateTrace 组件 = fn(*varargs, **kwargs) 文件“/home1/ajx/TALLRec/finetune_rec.py”，第 210 行，在 train model.print_trainable_parameters() # AttributeError : 'NoneType' object has no attribute 'print_trainable_parameters' 我的文件架构是： base_model: str = "./LlamaN", # the only required argument train_data_path: str = "./data /movie /train.json”，val_data_path：str =“./data/movie/valid.json”， output_dir：str =“./out”，样本：int = -1，

您好，我也遇到了同样的问题，请问最后解决了吗，是如何解决的？

SAI990323 commented 8 months ago

您好，在执行您的finetuneRec文件中报错： File "/home1/ajx/TALLRec/finetune_rec.py", line 325, in fire.Fire(train) File "/home1/ajx/anaconda3/envs/Tall/lib /python3.9/site-packages/fire/core.py”，第141行，在Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) 文件“/home1/ajx/anaconda3/envs/Tall/lib / python3.9/site-packages/fire/core.py"，第475行，在_Fire 组件中，remaining_args = _CallAndUpdateTrace( File "/home1/ajx/anaconda3/envs/Tall/lib/python3.9/site- packages/fire /core.py”，第 691 行，在 _CallAndUpdateTrace 组件 = fn(*varargs, **kwargs) 文件“/home1/ajx/TALLRec/finetune_rec.py”，第 210 行，在 train model.print_trainable_parameters() # AttributeError : 'NoneType' object has no attribute 'print_trainable_parameters' 我的文件架构是： base_model: str = "./LlamaN", # the only required argument train_data_path: str = "./data /movie /train.json”，val_data_path：str =“./data/movie/valid.json”， output_dir：str =“./out”，样本：int = -1，

您好，我也遇到了同样的问题，请问最后解决了吗，是如何解决的？

您好，请问问题解决了吗？

97z commented 6 months ago

这个bug是因为在print_trainable_parameters之前model变成了none，因为代码处有一个细节的bug。#model=set_peft_model_state_dict(model, adapters_weights) set_peft_model_state_dict(model, adapters_weights)这么改就不会有这样的bug了

JiangshuoZhao commented 6 months ago

这个bug是因为在print_trainable_parameters之前model变成了none，因为代码处有一个细节的bug。#model=set_peft_model_state_dict(model, adapters_weights) set_peft_model_state_dict(model, adapters_weights)这么改就不会有这样的bug了 set_peft_model_state_dict(model,adapters_weights)这么改不会会有这样的bug了

@97z

请问这样改以后，测试精度有影响吗？您的测试结果怎么样

JiangshuoZhao commented 6 months ago

这个bug是因为在print_trainable_parameters之前model变成了none，因为代码处有一个细节的bug。#model=set_peft_model_state_dict(model, adapters_weights) set_peft_model_state_dict(model, adapters_weights)这么改就不会有这样的bug了 set_peft_model_state_dict(model,adapters_weights)这么改不会会有这样的bug了

@97z

请问这样改以后，测试精度有影响吗？您的测试结果怎么样

我使用peft=0.3.0遇到了上面说的问题：

AttributeError: 'NoneType' object has no attribute 'print_trainable_parameters'

然后按你说的修改这句

model=set_peft_model_state_dict(model, adapters_weights)

set_peft_model_state_dict(model, adapters_weights)

然后发现精度很差，在movie上训练，movie上测试只有0.3多，book上测试0.4多。

之后更换peft

pip uninstall peft -y
pip install git+https://github.com/huggingface/peft.git@e536616888d51b453ed354a6f1e243fecb02ea08

peft版本变为0.3.0.dev0，不需要更改上面的代码了。在movie上训练，movie上测试只有0.64。

因而我的结论是要复现结果peft的版本很重要。

97z commented 6 months ago

这个bug是因为在print_trainable_parameters之前model变成了none，因为代码处有一个细节的bug。#model=set_peft_model_state_dict(model, adapters_weights) set_peft_model_state_dict(model, adapters_weights)这么改就不会有这样的bug了 set_peft_model_state_dict(model,adapters_weights)这么改不会会有这样的bug了

@97z 请问这样改以后，测试精度有影响吗？您的测试结果怎么样

我使用peft=0.3.0遇到了上面说的问题：

AttributeError: 'NoneType' object has no attribute 'print_trainable_parameters'

然后按你说的修改这句

model=set_peft_model_state_dict(model, adapters_weights)

set_peft_model_state_dict(model, adapters_weights)

然后发现精度很差，在movie上训练，movie上测试只有0.3多，book上测试0.4多。

之后更换peft
pip uninstall peft -y
pip install git+https://github.com/huggingface/peft.git@e536616888d51b453ed354a6f1e243fecb02ea08
peft版本变为0.3.0.dev0，不需要更改上面的代码了。在movie上训练，movie上测试只有0.64。

因而我的结论是要复现结果peft的版本很重要。

我是看别的代码使用set_peft_model_state_dict这个函数是这样使用的，再加上我debug的时候，发现原本的方式会导致model为none，遂改为上述方法。之前有碰到过一个bug，我搜索了以前的问题记录，将peft版本已经变为0.3.0.dev0。我epoch=200，我发现已经过拟合了，但是我还是拿着这个结果进行了测试。movie-movie是0.64左右，movie-book是0.6。我观察到最优的epoch大概为100，如果用epoch为100 的模型进行评估，效果应该比我现在的要好，可能会达到论文中的指标。不过我还没测试过。

JiangshuoZhao commented 6 months ago

这个bug是因为在print_trainable_parameters之前model变成了none，因为代码处有一个细节的bug。#model=set_peft_model_state_dict(model, adapters_weights) set_peft_model_state_dict(model, adapters_weights)这么改就不会有这样的bug了 set_peft_model_state_dict(model,adapters_weights)这么改不会会有这样的bug了

@97z 请问这样改以后，测试精度有影响吗？您的测试结果怎么样

我使用peft=0.3.0遇到了上面说的问题：

AttributeError: 'NoneType' object has no attribute 'print_trainable_parameters'

然后按你说的修改这句

model=set_peft_model_state_dict(model, adapters_weights)

set_peft_model_state_dict(model, adapters_weights)

然后发现精度很差，在movie上训练，movie上测试只有0.3多，book上测试0.4多。之后更换peft
pip uninstall peft -y
pip install git+https://github.com/huggingface/peft.git@e536616888d51b453ed354a6f1e243fecb02ea08
peft版本变为0.3.0.dev0，不需要更改上面的代码了。在movie上训练，movie上测试只有0.64。因而我的结论是要复现结果peft的版本很重要。
我是看别的代码使用set_peft_model_state_dict这个函数是这样使用的，再加上我debug的时候，发现原本的方式会导致model为none，遂改为上述方法。之前有碰到过一个bug，我搜索了以前的问题记录，将peft版本已经变为0.3.0.dev0。我epoch=200，我发现已经过拟合了，但是我还是拿着这个结果进行了测试。movie-movie是0.64左右，movie-book是0.6。我观察到最优的epoch大概为100，如果用epoch为100 的模型进行评估，效果应该比我现在的要好，可能会达到论文中的指标。不过我还没测试过。

0.6以上就可以了，至少没差太多。但peft版本不对的话就跑出这个精度。

SAI990323 commented 6 months ago

感谢讨论与关注，首先方法本身和peft版本没有太大关系，复现不了的情况确实是因为peft版本没有对上具体原因这个是因为peft版本更新后，对架构和函数做了调整，即使是相同的函数在新旧版本中也不能完全替代，如果您使用最新的peft版本可能需要参考一下官方文档，重新处理peft相关部分的代码以及保存的代码，这个可以参考一下使用新版本peft的仓库的代码的使用方式。我印象中的新版本peft的实现是需要有两处改动，一处是删除掉finetune_rec.py中282-287行的代码，如果不删除的话，lora文件不会保存，你在evaluate的过程中使用的lora将是一个随机初始化的lora，所以不会有任何效果，另一方面对于set_peft_model_state_dict的改动 @97z 的回答中已经提到

SAI990323 / TALLRec

环境问题 #41

model=set_peft_model_state_dict(model, adapters_weights)

model=set_peft_model_state_dict(model, adapters_weights)

model=set_peft_model_state_dict(model, adapters_weights)