ShomyLiu / Neu-Review-Rec

A Toolkit for Neural Review-based Recommendation models with Pytorch.
http://shomy.top/2019/12/31/neu-review-rec/
168 stars 54 forks source link

DAML模型疑问 #22

Open yinzhiqiangluvlzx opened 3 years ago

yinzhiqiangluvlzx commented 3 years ago

DAML模型训练没问题,测试加载时候报错: raceback (most recent call last): File "", line 1, in File "G:\yzq\pycharm\PyCharm 2019.1.2\helpers\pydev_pydev_bundle\pydev_umd.py", line 197, in runfile pydev_imports.execfile(filename, global_vars, local_vars) # execute the script File "G:\yzq\pycharm\PyCharm 2019.1.2\helpers\pydev_pydev_imps_pydev_execfile.py", line 18, in execfile exec(compile(contents+"\n", file, 'exec'), glob, loc) File "G:/yzq/Rec/Neu-Review-Rec/main.py", line 210, in fire.Fire() File "G:\yzq\anaconda3\envs\pytorch\lib\site-packages\fire\core.py", line 138, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "G:\yzq\anaconda3\envs\pytorch\lib\site-packages\fire\core.py", line 468, in _Fire target=component.name) File "G:\yzq\anaconda3\envs\pytorch\lib\site-packages\fire\core.py", line 672, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "G:/yzq/Rec/Neu-Review-Rec/main.py", line 155, in test model.load(opt.pth_path) File "G:\yzq\Rec\Neu-Review-Rec\framework\models.py", line 49, in load self.load_state_dict(torch.load(path),False) File "G:\yzq\anaconda3\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 1052, in load_state_dict self.class.name, "\n\t".join(error_msgs))) RuntimeError: Error(s) in loading state_dict for Model: size mismatch for predict_net.model.fm_V: copying a param with shape torch.Size([16, 128]) from checkpoint, the shape in current model is torch.Size([128, 10]). 跑模型时候仅仅是修改了fea=2,跑了2天才训练好,测试时候也没做修改,报这个错搜了一圈也没找到,想问下作者之前有遇到过嘛,谢谢您啦!

FKCHAN commented 3 years ago

--train : python3 main.py train --dataset=Patio_Lawn_and_Garden_data --model=DAML --num_fea=1 --output=fm

--error
euclidean = (user_local_fea - item_local_fea.permute(0, 1, 3, 2)).pow(2).sum(1).sqrt() RuntimeError: CUDA out of memory. Tried to allocate 5.94 GiB (GPU 0; 10.76 GiB total capacity; 6.17 GiB already allocated; 3.82 GiB free; 6.18 GiB reserved in total by PyTorch)

Please tell me where the error is.

ShomyLiu commented 3 years ago

@yinzhiqiangluvlzx 你好, 我刚刚测试下,没有问题;我的训练代码:

python3 main.py train --model=DAML --num_fea=2 --batch_size=16

测试脚本为:

python3 main.py test --model=DAML --num_fea=2 --batch_size=16 --pth_path='./checkpoints/DAML_Digital_Music_data_defau
lt.pth'

看报错信息应该是你那边一些参数没有修改,导致训练和测试不一致。

FKCHAN commented 3 years ago

恩,谢谢大佬回信,我改好batch_size可以运行了 2080ti 11g 运行太慢了,想用三块一起跑,但是模型保存出错,这部分应该怎么解决呢?我把这个问题提到另一个DAML的issues里了 在2020年12月23日 14:33,HT Liunotifications@github.com 写道:

@yinzhiqiangluvlzx 你好, 我刚刚测试下,没有问题;我的训练代码:

python3 main.py train --model=DAML --num_fea=2 --batch_size=16

测试脚本为:

python3 main.py test --model=DAML --num_fea=2 --batch_size=16 --pth_path='./checkpoints/DAML_Digital_Music_data_defau

lt.pth'

看报错信息应该是你那边一些参数没有修改,导致训练和测试不一致。

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

ShomyLiu commented 3 years ago

@FKCHAN 在那个issue里面已经提到, 多卡模型的save与单卡有点不同, https://pytorch.org/tutorials/beginner/saving_loading_models.html#saving-torch-nn-dataparallel-models

后期的计划, 用pytorch-lightning 包装下模型,更好更简单的支持并行训练。 预计春节前做。

FKCHAN commented 3 years ago

好的,那我就先一边训练一遍测试了,期待中,大佬加油,fighting!

在2020年12月23日 14:41,HT Liunotifications@github.com 写道:

@FKCHAN 在那个issue里面已经提到, 多卡模型的save与单卡有点不同, https://pytorch.org/tutorials/beginner/saving_loading_models.html#saving-torch-nn-dataparallel-models

后期的计划, 用pytorch-lightning 包装下模型,更好更简单的支持并行训练。 预计春节前做。

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.