Open yinzhiqiangluvlzx opened 3 years ago
--train : python3 main.py train --dataset=Patio_Lawn_and_Garden_data --model=DAML --num_fea=1 --output=fm
--error
euclidean = (user_local_fea - item_local_fea.permute(0, 1, 3, 2)).pow(2).sum(1).sqrt()
RuntimeError: CUDA out of memory. Tried to allocate 5.94 GiB (GPU 0; 10.76 GiB total capacity; 6.17 GiB already allocated; 3.82 GiB free; 6.18 GiB reserved in total by PyTorch)
Please tell me where the error is.
@yinzhiqiangluvlzx 你好, 我刚刚测试下,没有问题;我的训练代码:
python3 main.py train --model=DAML --num_fea=2 --batch_size=16
测试脚本为:
python3 main.py test --model=DAML --num_fea=2 --batch_size=16 --pth_path='./checkpoints/DAML_Digital_Music_data_defau
lt.pth'
看报错信息应该是你那边一些参数没有修改,导致训练和测试不一致。
恩,谢谢大佬回信,我改好batch_size可以运行了 2080ti 11g 运行太慢了,想用三块一起跑,但是模型保存出错,这部分应该怎么解决呢?我把这个问题提到另一个DAML的issues里了 在2020年12月23日 14:33,HT Liunotifications@github.com 写道:
@yinzhiqiangluvlzx 你好, 我刚刚测试下,没有问题;我的训练代码:
python3 main.py train --model=DAML --num_fea=2 --batch_size=16
测试脚本为:
python3 main.py test --model=DAML --num_fea=2 --batch_size=16 --pth_path='./checkpoints/DAML_Digital_Music_data_defau
lt.pth'
看报错信息应该是你那边一些参数没有修改,导致训练和测试不一致。
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.
@FKCHAN 在那个issue里面已经提到, 多卡模型的save与单卡有点不同, https://pytorch.org/tutorials/beginner/saving_loading_models.html#saving-torch-nn-dataparallel-models
后期的计划, 用pytorch-lightning 包装下模型,更好更简单的支持并行训练。 预计春节前做。
好的,那我就先一边训练一遍测试了,期待中,大佬加油,fighting!
在2020年12月23日 14:41,HT Liunotifications@github.com 写道:
@FKCHAN 在那个issue里面已经提到, 多卡模型的save与单卡有点不同, https://pytorch.org/tutorials/beginner/saving_loading_models.html#saving-torch-nn-dataparallel-models
后期的计划, 用pytorch-lightning 包装下模型,更好更简单的支持并行训练。 预计春节前做。
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
DAML模型训练没问题,测试加载时候报错: raceback (most recent call last): File "", line 1, in
File "G:\yzq\pycharm\PyCharm 2019.1.2\helpers\pydev_pydev_bundle\pydev_umd.py", line 197, in runfile
pydev_imports.execfile(filename, global_vars, local_vars) # execute the script
File "G:\yzq\pycharm\PyCharm 2019.1.2\helpers\pydev_pydev_imps_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "G:/yzq/Rec/Neu-Review-Rec/main.py", line 210, in
fire.Fire()
File "G:\yzq\anaconda3\envs\pytorch\lib\site-packages\fire\core.py", line 138, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "G:\yzq\anaconda3\envs\pytorch\lib\site-packages\fire\core.py", line 468, in _Fire
target=component.name)
File "G:\yzq\anaconda3\envs\pytorch\lib\site-packages\fire\core.py", line 672, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "G:/yzq/Rec/Neu-Review-Rec/main.py", line 155, in test
model.load(opt.pth_path)
File "G:\yzq\Rec\Neu-Review-Rec\framework\models.py", line 49, in load
self.load_state_dict(torch.load(path),False)
File "G:\yzq\anaconda3\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 1052, in load_state_dict
self.class.name, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for Model:
size mismatch for predict_net.model.fm_V: copying a param with shape torch.Size([16, 128]) from checkpoint, the shape in current model is torch.Size([128, 10]).
跑模型时候仅仅是修改了fea=2,跑了2天才训练好,测试时候也没做修改,报这个错搜了一圈也没找到,想问下作者之前有遇到过嘛,谢谢您啦!