ShuweiShao / IEBins

[NeurIPS2023] IEBins: Iterative Elastic Bins for Monocular Depth Estimation
MIT License
80 stars 4 forks source link

训练耗时145h,目前21个epoch的NYUv2性能RMSE0.3622 #10

Closed jay25208 closed 6 months ago

jay25208 commented 6 months ago

感谢作者的工作! 我复现NYUv2的训练,Swin-L,4卡P40,每卡batch只能训练1张图,预估训练耗时145h,训练效率正常吗? 目前训练到21个epoch,eval结果NYUv2性能RMSE0.338,而且期间波动还不小到0.38左右。离目标0.314还有不少差距。请问训练epoch影响很大么,以及模型训练稳定性是怎样的? 感谢!

ShuweiShao commented 6 months ago

感觉不太正常,有加载backbone在imagenet上的预训练模型吗,21个epoch基本可以收敛了,开源前我训练过多次,每次都能到0.314附近,比较稳定。

我自己是用的A5000, 4张卡,8 batch size, 基本训练到20多个epoch差不多了,大概20多到30个小时。

jay25208 commented 6 months ago

感觉不太正常,有加载backbone在imagenet上的预训练模型吗,21个epoch基本可以收敛了,开源前我训练过多次,每次都能到0.314附近,比较稳定。 我自己是用的A5000, 4张卡,8 batch size, 基本训练到20多个epoch差不多了,大概20多到30个小时。

看日志,好像4卡训练,第0张卡加载模型有异常,很奇怪。运行起来显存占用/总共是19779/23040 MB。你有遇到过类似问题么?

This will evaluate the model every eval_freq 1000 steps and save best models for individual eval metrics.
== Use GPU: 2 for training
== Load encoder backbone from: model_zoo/swin_transformer/swin_large_patch4_window7_224_22k.pth
== Total number of parameters: 272927893
== Total number of learning parameters: 272927893
== Model Initialized on GPU: 2
== Initial variables' sum: -58328.283, avg: -123.577
== Use GPU: 1 for training
== Load encoder backbone from: model_zoo/swin_transformer/swin_large_patch4_window7_224_22k.pth
== Total number of parameters: 272927893
== Total number of learning parameters: 272927893
== Model Initialized on GPU: 1
== Initial variables' sum: -58328.283, avg: -123.577
== Use GPU: 3 for training
== Load encoder backbone from: model_zoo/swin_transformer/swin_large_patch4_window7_224_22k.pth
== Total number of parameters: 272927893
== Total number of learning parameters: 272927893
== Model Initialized on GPU: 3
== Initial variables' sum: -58328.283, avg: -123.577
== Use GPU: 0 for training
== Load encoder backbone from: model_zoo/swin_transformer/swin_large_patch4_window7_224_22k.pth
The model and loaded state dict do not match exactly

unexpected key in source state_dict: norm.weight, norm.bias, head.weight, head.bias, layers.0.blocks.1.attn_mask, layers.1.blocks.1.attn_mask, layers.2.blocks.1.attn_mask, layers.2.blocks.3.attn_mask, layers.2.blocks.5.attn_mask, layers.2.blocks.7.attn_mask, layers.2.blocks.9.attn_mask, layers.2.blocks.11.attn_mask, layers.2.blocks.13.attn_mask, layers.2.blocks.15.attn_mask, layers.2.blocks.17.attn_mask

missing keys in source state_dict: norm0.weight, norm0.bias, norm1.weight, norm1.bias, norm2.weight, norm2.bias, norm3.weight, norm3.bias

== Total number of parameters: 272927893
== Total number of learning parameters: 272927893
== Model Initialized on GPU: 0
== Initial variables' sum: -58328.283, avg: -123.577
[epoch][s/s_per_e/gs]: [0][0/9064/0], lr: 0.000020000000, loss: 10.936847686768
ShuweiShao commented 6 months ago

The model and loaded state dict do not match exactly

unexpected key in source state_dict: norm.weight, norm.bias, head.weight, head.bias, layers.0.blocks.1.attn_mask, layers.1.blocks.1.attn_mask, layers.2.blocks.1.attn_mask, layers.2.blocks.3.attn_mask, layers.2.blocks.5.attn_mask, layers.2.blocks.7.attn_mask, layers.2.blocks.9.attn_mask, layers.2.blocks.11.attn_mask, layers.2.blocks.13.attn_mask, layers.2.blocks.15.attn_mask, layers.2.blocks.17.attn_mask

missing keys in source state_dict: norm0.weight, norm0.bias, norm1.weight, norm1.bias, norm2.weight, norm2.bias, norm3.weight, norm3.bias, 这个warning可以忽略

P40的显存是24GB吗?如果是的话,每张卡跑2张图像应该是没问题的。

jay25208 commented 6 months ago

The model and loaded state dict do not match exactly

unexpected key in source state_dict: norm.weight, norm.bias, head.weight, head.bias, layers.0.blocks.1.attn_mask, layers.1.blocks.1.attn_mask, layers.2.blocks.1.attn_mask, layers.2.blocks.3.attn_mask, layers.2.blocks.5.attn_mask, layers.2.blocks.7.attn_mask, layers.2.blocks.9.attn_mask, layers.2.blocks.11.attn_mask, layers.2.blocks.13.attn_mask, layers.2.blocks.15.attn_mask, layers.2.blocks.17.attn_mask

missing keys in source state_dict: norm0.weight, norm0.bias, norm1.weight, norm1.bias, norm2.weight, norm2.bias, norm3.weight, norm3.bias, 这个warning可以忽略

P40的显存是24GB吗?如果是的话,每张卡跑2张图像应该是没问题的。

对,23040 MB。试过跑2张图,会超显存。现在1张图是占用19779 MB。 那个wraining不管的话,就是每张卡都加载成功了模型,性能还不够好。 我后面换显卡再试试。 谢谢!

jay25208 commented 6 months ago

配置txt里--batch_size 4,代表所有卡的图像数量,看代码每张卡是batch_size / GPU_num。4卡每卡2张图,就配置--batch_size 8,对吧。

ShuweiShao commented 6 months ago

不客气,有修改setting吗,感觉跑到0.314附近应该很轻松呀,我之前跑每张显卡2个图像,显存可能还不到20G没记错的话。

ShuweiShao commented 6 months ago

jay25208 commented 6 months ago

不客气,有修改setting吗,感觉跑到0.314附近应该很轻松呀,我之前跑每张显卡2个图像,显存可能还不到20G没记错的话。

没有改setting。只是sync数据从NeWCRFs下的,不是BTS的(因谷歌Driver下不了),但数据列表能读应该也不是问题。我再重新换卡训练试试。

ShuweiShao commented 6 months ago

有新的进展随时交流,感谢。

jay25208 commented 6 months ago

换了卡后基本复现了性能。 换了32G的卡,每卡最多4张图作为batch_size,占用28G左右,用4卡训练预计耗时100h左右。 训练20epoch左右是最好模型RMSE能达到0.317左右,现在训到35epoch在0.3221,会有波动。论文中写训20epoch,怎么工程配置文件中是50epoch,如不需要我就训20epoch,减少训练时间。

之前用的卡性能没复现,每张卡只有1张图作为batch_size,不确定是不是这个因素造成。

初步看有些代码与URCDC-Depth相关,比如有焦距的内容但似乎没用上。后续会深入研究您的代码工程,向您学习请教。

ShuweiShao commented 6 months ago

20个epoch基本就收敛了,可以设成25或30, 欢迎交流。

ShuweiShao commented 6 months ago

我的vx, shaoshuweifighting