大佬，bacth_size=1,Out of memory?

WongKinYiu / yolor

implementation of paper - You Only Learn One Representation: Unified Network for Multiple Tasks (https://arxiv.org/abs/2105.04206)

GNU General Public License v3.0

1.98k stars 524 forks source link

大佬，bacth_size=1,Out of memory? #88

Closed crazybill-first closed 2 years ago

crazybill-first commented 2 years ago

大佬，您好，我在使用自定义数据集训练yolor_p6时出现 cuda out of memory,我把batch_size=1依然会出现。我很奇怪。 train command: python train.py --batch-size 1 --img 416 416 --data person.yaml --cfg cfg/yolor_p6.cfg --weights '' --device 2 --name yolor_p6 --hyp hyp.scratch.416.yaml --epochs 300 log result 是我哪步出错了吗？

WongKinYiu commented 2 years ago

確實挺奇怪的, 8G free, 要 2G 卻不夠. 看看有沒有其他程式在跑, 和有沒有設定 GPU 只能被 1 個 process 獨佔.

crazybill-first commented 2 years ago

额。。。。并没有找到相关的进程，我重启机器后就可以训练了 -_- !!!。另外 README中 Multiple GPU training command适用windows吗？

crazybill-first commented 2 years ago

因为我试了一下多GPU训练，由于NCCL的原因出错。这个问题只会出现在windows上吗？

WongKinYiu commented 2 years ago

https://github.com/WongKinYiu/yolor/blob/main/train.py#L518

可以試試看把 nccl 改成 gloo.

crazybill-first commented 2 years ago

不好意思，忘记回复您了。谢谢您的建议，使用gloo后已经可以多卡训练了。不过又出现了一个问题，我用了4张卡，结果打开GPU信息，发现只用了两个，我正在看代码找原因。我已经开始训练行人检测的模型，训练完成后我会把结果展示出来

crazybill-first commented 2 years ago

不好意思，由于某些原因我的结果不能展示出来，我会关闭这个问题