DAML模型疑问 - Githubissues

yinzhiqiangluvlzx commented 3 years ago

DAML模型训练没问题，测试加载时候报错： raceback (most recent call last): File "", line 1, in File "G:\yzq\pycharm\PyCharm 2019.1.2\helpers\pydev_pydev_bundle\pydev_umd.py", line 197, in runfile pydev_imports.execfile(filename, global_vars, local_vars) # execute the script File "G:\yzq\pycharm\PyCharm 2019.1.2\helpers\pydev_pydev_imps_pydev_execfile.py", line 18, in execfile exec(compile(contents+"\n", file, 'exec'), glob, loc) File "G:/yzq/Rec/Neu-Review-Rec/main.py", line 210, in fire.Fire() File "G:\yzq\anaconda3\envs\pytorch\lib\site-packages\fire\core.py", line 138, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "G:\yzq\anaconda3\envs\pytorch\lib\site-packages\fire\core.py", line 468, in _Fire target=component.name) File "G:\yzq\anaconda3\envs\pytorch\lib\site-packages\fire\core.py", line 672, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "G:/yzq/Rec/Neu-Review-Rec/main.py", line 155, in test model.load(opt.pth_path) File "G:\yzq\Rec\Neu-Review-Rec\framework\models.py", line 49, in load self.load_state_dict(torch.load(path),False) File "G:\yzq\anaconda3\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 1052, in load_state_dict self.class.name, "\n\t".join(error_msgs))) RuntimeError: Error(s) in loading state_dict for Model: size mismatch for predict_net.model.fm_V: copying a param with shape torch.Size([16, 128]) from checkpoint, the shape in current model is torch.Size([128, 10]). 跑模型时候仅仅是修改了fea=2，跑了2天才训练好，测试时候也没做修改，报这个错搜了一圈也没找到，想问下作者之前有遇到过嘛，谢谢您啦！

FKCHAN commented 3 years ago

--train ： python3 main.py train --dataset=Patio_Lawn_and_Garden_data --model=DAML --num_fea=1 --output=fm

--error
euclidean = (user_local_fea - item_local_fea.permute(0, 1, 3, 2)).pow(2).sum(1).sqrt() RuntimeError: CUDA out of memory. Tried to allocate 5.94 GiB (GPU 0; 10.76 GiB total capacity; 6.17 GiB already allocated; 3.82 GiB free; 6.18 GiB reserved in total by PyTorch)

Please tell me where the error is.

ShomyLiu commented 3 years ago

@yinzhiqiangluvlzx 你好，我刚刚测试下，没有问题；我的训练代码：

python3 main.py train --model=DAML --num_fea=2 --batch_size=16

测试脚本为：

python3 main.py test --model=DAML --num_fea=2 --batch_size=16 --pth_path='./checkpoints/DAML_Digital_Music_data_defau
lt.pth'

看报错信息应该是你那边一些参数没有修改，导致训练和测试不一致。

FKCHAN commented 3 years ago

恩，谢谢大佬回信，我改好batch_size可以运行了 2080ti 11g 运行太慢了，想用三块一起跑，但是模型保存出错，这部分应该怎么解决呢？我把这个问题提到另一个DAML的issues里了在2020年12月23日 14:33，HT Liunotifications@github.com 写道：

@yinzhiqiangluvlzx 你好，我刚刚测试下，没有问题；我的训练代码：

python3 main.py train --model=DAML --num_fea=2 --batch_size=16

测试脚本为：

python3 main.py test --model=DAML --num_fea=2 --batch_size=16 --pth_path='./checkpoints/DAML_Digital_Music_data_defau

lt.pth'

看报错信息应该是你那边一些参数没有修改，导致训练和测试不一致。

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

ShomyLiu commented 3 years ago

@FKCHAN 在那个issue里面已经提到，多卡模型的save与单卡有点不同， https://pytorch.org/tutorials/beginner/saving_loading_models.html#saving-torch-nn-dataparallel-models

后期的计划，用pytorch-lightning 包装下模型，更好更简单的支持并行训练。预计春节前做。

FKCHAN commented 3 years ago

好的，那我就先一边训练一遍测试了，期待中，大佬加油，fighting！

在2020年12月23日 14:41，HT Liunotifications@github.com 写道：

@FKCHAN 在那个issue里面已经提到，多卡模型的save与单卡有点不同， https://pytorch.org/tutorials/beginner/saving_loading_models.html#saving-torch-nn-dataparallel-models

后期的计划，用pytorch-lightning 包装下模型，更好更简单的支持并行训练。预计春节前做。

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

ShomyLiu / Neu-Review-Rec

DAML模型疑问 #22