训练耗时145h，目前21个epoch的NYUv2性能RMSE0.3622

jay25208 commented 6 months ago

感谢作者的工作！我复现NYUv2的训练，Swin-L，4卡P40，每卡batch只能训练1张图，预估训练耗时145h，训练效率正常吗？目前训练到21个epoch，eval结果NYUv2性能RMSE0.338，而且期间波动还不小到0.38左右。离目标0.314还有不少差距。请问训练epoch影响很大么，以及模型训练稳定性是怎样的？感谢！

ShuweiShao commented 6 months ago

感觉不太正常，有加载backbone在imagenet上的预训练模型吗，21个epoch基本可以收敛了，开源前我训练过多次，每次都能到0.314附近，比较稳定。

我自己是用的A5000， 4张卡，8 batch size，基本训练到20多个epoch差不多了，大概20多到30个小时。

jay25208 commented 6 months ago

感觉不太正常，有加载backbone在imagenet上的预训练模型吗，21个epoch基本可以收敛了，开源前我训练过多次，每次都能到0.314附近，比较稳定。我自己是用的A5000， 4张卡，8 batch size，基本训练到20多个epoch差不多了，大概20多到30个小时。

看日志，好像4卡训练，第0张卡加载模型有异常，很奇怪。运行起来显存占用/总共是19779/23040 MB。你有遇到过类似问题么？

This will evaluate the model every eval_freq 1000 steps and save best models for individual eval metrics.
== Use GPU: 2 for training
== Load encoder backbone from: model_zoo/swin_transformer/swin_large_patch4_window7_224_22k.pth
== Total number of parameters: 272927893
== Total number of learning parameters: 272927893
== Model Initialized on GPU: 2
== Initial variables' sum: -58328.283, avg: -123.577
== Use GPU: 1 for training
== Load encoder backbone from: model_zoo/swin_transformer/swin_large_patch4_window7_224_22k.pth
== Total number of parameters: 272927893
== Total number of learning parameters: 272927893
== Model Initialized on GPU: 1
== Initial variables' sum: -58328.283, avg: -123.577
== Use GPU: 3 for training
== Load encoder backbone from: model_zoo/swin_transformer/swin_large_patch4_window7_224_22k.pth
== Total number of parameters: 272927893
== Total number of learning parameters: 272927893
== Model Initialized on GPU: 3
== Initial variables' sum: -58328.283, avg: -123.577
== Use GPU: 0 for training
== Load encoder backbone from: model_zoo/swin_transformer/swin_large_patch4_window7_224_22k.pth
The model and loaded state dict do not match exactly

unexpected key in source state_dict: norm.weight, norm.bias, head.weight, head.bias, layers.0.blocks.1.attn_mask, layers.1.blocks.1.attn_mask, layers.2.blocks.1.attn_mask, layers.2.blocks.3.attn_mask, layers.2.blocks.5.attn_mask, layers.2.blocks.7.attn_mask, layers.2.blocks.9.attn_mask, layers.2.blocks.11.attn_mask, layers.2.blocks.13.attn_mask, layers.2.blocks.15.attn_mask, layers.2.blocks.17.attn_mask

missing keys in source state_dict: norm0.weight, norm0.bias, norm1.weight, norm1.bias, norm2.weight, norm2.bias, norm3.weight, norm3.bias

== Total number of parameters: 272927893
== Total number of learning parameters: 272927893
== Model Initialized on GPU: 0
== Initial variables' sum: -58328.283, avg: -123.577
[epoch][s/s_per_e/gs]: [0][0/9064/0], lr: 0.000020000000, loss: 10.936847686768

ShuweiShao commented 6 months ago

The model and loaded state dict do not match exactly

unexpected key in source state_dict: norm.weight, norm.bias, head.weight, head.bias, layers.0.blocks.1.attn_mask, layers.1.blocks.1.attn_mask, layers.2.blocks.1.attn_mask, layers.2.blocks.3.attn_mask, layers.2.blocks.5.attn_mask, layers.2.blocks.7.attn_mask, layers.2.blocks.9.attn_mask, layers.2.blocks.11.attn_mask, layers.2.blocks.13.attn_mask, layers.2.blocks.15.attn_mask, layers.2.blocks.17.attn_mask

missing keys in source state_dict: norm0.weight, norm0.bias, norm1.weight, norm1.bias, norm2.weight, norm2.bias, norm3.weight, norm3.bias，这个warning可以忽略

P40的显存是24GB吗？如果是的话，每张卡跑2张图像应该是没问题的。

jay25208 commented 6 months ago

The model and loaded state dict do not match exactly

unexpected key in source state_dict: norm.weight, norm.bias, head.weight, head.bias, layers.0.blocks.1.attn_mask, layers.1.blocks.1.attn_mask, layers.2.blocks.1.attn_mask, layers.2.blocks.3.attn_mask, layers.2.blocks.5.attn_mask, layers.2.blocks.7.attn_mask, layers.2.blocks.9.attn_mask, layers.2.blocks.11.attn_mask, layers.2.blocks.13.attn_mask, layers.2.blocks.15.attn_mask, layers.2.blocks.17.attn_mask

missing keys in source state_dict: norm0.weight, norm0.bias, norm1.weight, norm1.bias, norm2.weight, norm2.bias, norm3.weight, norm3.bias，这个warning可以忽略

P40的显存是24GB吗？如果是的话，每张卡跑2张图像应该是没问题的。

对，23040 MB。试过跑2张图，会超显存。现在1张图是占用19779 MB。那个wraining不管的话，就是每张卡都加载成功了模型，性能还不够好。我后面换显卡再试试。谢谢！

jay25208 commented 6 months ago

配置txt里--batch_size 4，代表所有卡的图像数量，看代码每张卡是batch_size / GPU_num。4卡每卡2张图，就配置--batch_size 8，对吧。

ShuweiShao commented 6 months ago

不客气，有修改setting吗，感觉跑到0.314附近应该很轻松呀，我之前跑每张显卡2个图像，显存可能还不到20G没记错的话。

ShuweiShao commented 6 months ago

对

jay25208 commented 6 months ago

不客气，有修改setting吗，感觉跑到0.314附近应该很轻松呀，我之前跑每张显卡2个图像，显存可能还不到20G没记错的话。

没有改setting。只是sync数据从NeWCRFs下的，不是BTS的（因谷歌Driver下不了），但数据列表能读应该也不是问题。我再重新换卡训练试试。

ShuweiShao commented 6 months ago

有新的进展随时交流，感谢。

jay25208 commented 6 months ago

换了卡后基本复现了性能。换了32G的卡，每卡最多4张图作为batch_size，占用28G左右，用4卡训练预计耗时100h左右。训练20epoch左右是最好模型RMSE能达到0.317左右，现在训到35epoch在0.3221，会有波动。论文中写训20epoch，怎么工程配置文件中是50epoch，如不需要我就训20epoch，减少训练时间。

之前用的卡性能没复现，每张卡只有1张图作为batch_size，不确定是不是这个因素造成。

初步看有些代码与URCDC-Depth相关，比如有焦距的内容但似乎没用上。后续会深入研究您的代码工程，向您学习请教。

ShuweiShao commented 6 months ago

20个epoch基本就收敛了，可以设成25或30，欢迎交流。

ShuweiShao commented 6 months ago

我的vx, shaoshuweifighting

ShuweiShao / IEBins

训练耗时145h，目前21个epoch的NYUv2性能RMSE0.3622 #10