Closed jay25208 closed 6 months ago
感觉不太正常,有加载backbone在imagenet上的预训练模型吗,21个epoch基本可以收敛了,开源前我训练过多次,每次都能到0.314附近,比较稳定。
我自己是用的A5000, 4张卡,8 batch size, 基本训练到20多个epoch差不多了,大概20多到30个小时。
感觉不太正常,有加载backbone在imagenet上的预训练模型吗,21个epoch基本可以收敛了,开源前我训练过多次,每次都能到0.314附近,比较稳定。 我自己是用的A5000, 4张卡,8 batch size, 基本训练到20多个epoch差不多了,大概20多到30个小时。
看日志,好像4卡训练,第0张卡加载模型有异常,很奇怪。运行起来显存占用/总共是19779/23040 MB。你有遇到过类似问题么?
This will evaluate the model every eval_freq 1000 steps and save best models for individual eval metrics.
== Use GPU: 2 for training
== Load encoder backbone from: model_zoo/swin_transformer/swin_large_patch4_window7_224_22k.pth
== Total number of parameters: 272927893
== Total number of learning parameters: 272927893
== Model Initialized on GPU: 2
== Initial variables' sum: -58328.283, avg: -123.577
== Use GPU: 1 for training
== Load encoder backbone from: model_zoo/swin_transformer/swin_large_patch4_window7_224_22k.pth
== Total number of parameters: 272927893
== Total number of learning parameters: 272927893
== Model Initialized on GPU: 1
== Initial variables' sum: -58328.283, avg: -123.577
== Use GPU: 3 for training
== Load encoder backbone from: model_zoo/swin_transformer/swin_large_patch4_window7_224_22k.pth
== Total number of parameters: 272927893
== Total number of learning parameters: 272927893
== Model Initialized on GPU: 3
== Initial variables' sum: -58328.283, avg: -123.577
== Use GPU: 0 for training
== Load encoder backbone from: model_zoo/swin_transformer/swin_large_patch4_window7_224_22k.pth
The model and loaded state dict do not match exactly
unexpected key in source state_dict: norm.weight, norm.bias, head.weight, head.bias, layers.0.blocks.1.attn_mask, layers.1.blocks.1.attn_mask, layers.2.blocks.1.attn_mask, layers.2.blocks.3.attn_mask, layers.2.blocks.5.attn_mask, layers.2.blocks.7.attn_mask, layers.2.blocks.9.attn_mask, layers.2.blocks.11.attn_mask, layers.2.blocks.13.attn_mask, layers.2.blocks.15.attn_mask, layers.2.blocks.17.attn_mask
missing keys in source state_dict: norm0.weight, norm0.bias, norm1.weight, norm1.bias, norm2.weight, norm2.bias, norm3.weight, norm3.bias
== Total number of parameters: 272927893
== Total number of learning parameters: 272927893
== Model Initialized on GPU: 0
== Initial variables' sum: -58328.283, avg: -123.577
[epoch][s/s_per_e/gs]: [0][0/9064/0], lr: 0.000020000000, loss: 10.936847686768
The model and loaded state dict do not match exactly
unexpected key in source state_dict: norm.weight, norm.bias, head.weight, head.bias, layers.0.blocks.1.attn_mask, layers.1.blocks.1.attn_mask, layers.2.blocks.1.attn_mask, layers.2.blocks.3.attn_mask, layers.2.blocks.5.attn_mask, layers.2.blocks.7.attn_mask, layers.2.blocks.9.attn_mask, layers.2.blocks.11.attn_mask, layers.2.blocks.13.attn_mask, layers.2.blocks.15.attn_mask, layers.2.blocks.17.attn_mask
missing keys in source state_dict: norm0.weight, norm0.bias, norm1.weight, norm1.bias, norm2.weight, norm2.bias, norm3.weight, norm3.bias, 这个warning可以忽略
P40的显存是24GB吗?如果是的话,每张卡跑2张图像应该是没问题的。
The model and loaded state dict do not match exactly
unexpected key in source state_dict: norm.weight, norm.bias, head.weight, head.bias, layers.0.blocks.1.attn_mask, layers.1.blocks.1.attn_mask, layers.2.blocks.1.attn_mask, layers.2.blocks.3.attn_mask, layers.2.blocks.5.attn_mask, layers.2.blocks.7.attn_mask, layers.2.blocks.9.attn_mask, layers.2.blocks.11.attn_mask, layers.2.blocks.13.attn_mask, layers.2.blocks.15.attn_mask, layers.2.blocks.17.attn_mask
missing keys in source state_dict: norm0.weight, norm0.bias, norm1.weight, norm1.bias, norm2.weight, norm2.bias, norm3.weight, norm3.bias, 这个warning可以忽略
P40的显存是24GB吗?如果是的话,每张卡跑2张图像应该是没问题的。
对,23040 MB。试过跑2张图,会超显存。现在1张图是占用19779 MB。 那个wraining不管的话,就是每张卡都加载成功了模型,性能还不够好。 我后面换显卡再试试。 谢谢!
配置txt里--batch_size 4,代表所有卡的图像数量,看代码每张卡是batch_size / GPU_num。4卡每卡2张图,就配置--batch_size 8,对吧。
不客气,有修改setting吗,感觉跑到0.314附近应该很轻松呀,我之前跑每张显卡2个图像,显存可能还不到20G没记错的话。
对
不客气,有修改setting吗,感觉跑到0.314附近应该很轻松呀,我之前跑每张显卡2个图像,显存可能还不到20G没记错的话。
没有改setting。只是sync数据从NeWCRFs下的,不是BTS的(因谷歌Driver下不了),但数据列表能读应该也不是问题。我再重新换卡训练试试。
有新的进展随时交流,感谢。
换了卡后基本复现了性能。 换了32G的卡,每卡最多4张图作为batch_size,占用28G左右,用4卡训练预计耗时100h左右。 训练20epoch左右是最好模型RMSE能达到0.317左右,现在训到35epoch在0.3221,会有波动。论文中写训20epoch,怎么工程配置文件中是50epoch,如不需要我就训20epoch,减少训练时间。
之前用的卡性能没复现,每张卡只有1张图作为batch_size,不确定是不是这个因素造成。
初步看有些代码与URCDC-Depth相关,比如有焦距的内容但似乎没用上。后续会深入研究您的代码工程,向您学习请教。
20个epoch基本就收敛了,可以设成25或30, 欢迎交流。
我的vx, shaoshuweifighting
感谢作者的工作! 我复现NYUv2的训练,Swin-L,4卡P40,每卡batch只能训练1张图,预估训练耗时145h,训练效率正常吗? 目前训练到21个epoch,eval结果NYUv2性能RMSE0.338,而且期间波动还不小到0.38左右。离目标0.314还有不少差距。请问训练epoch影响很大么,以及模型训练稳定性是怎样的? 感谢!