BN/Affine layer related issues

taosean commented 4 years ago

Hi @chaoyuaw, sorry to bother you, I have some confusions about SpatialBN layer in this repo.

I see in config files, these params are set as

MODEL:
  USE_AFFINE: True
CHECKPOINT:
  CONVERT_MODEL: True
NONLOCAL:
  USE_BN: False
  USE_AFFINE: True

I wonder, are these params related?

According to my understanding, if MODEL.USE_AFFINE=False (which means using SpatialBN), then CHECKPOINT.CONVERT_MODEL should be set as False. Is my understanding right?

I ported a network from Pytorch to Caffe2 and converted the Pytorch version weight file to Caffe2 version weight, however, I cannot get the same result as in Pytorch version from the converted weight file. (The pytorch model is trained with 3d BN)

I suppose this have something to do with BatchNorm operations. I see SpatialBN is 2d BN, can it be used in the model with 3d convolution?

If I want to finetue this converted model, should I finetune it with MODEL.USE_AFFINE True or False?

Thanks!

chaoyuaw commented 4 years ago

"I wonder, are these params related?" Yes, MODEL.USE_AFFINE=True will convert BN layers to "affine" layers and effectively this freezes the BN layers. And that's why we need to set "CHECKPOINT.CONVERT_MODEL=True" to convert the weights of BN layers into a format that can be used by the affine layers. (See further reply below for why/when we want to use it. )

NONLOCAL.USE_BN and NONLOCAL.USE_AFFINE means slightly different things. Please see https://github.com/facebookresearch/video-long-term-feature-banks/blob/master/lib/models/nonlocal_helper.py#L146 for the exact implementation.

"According to my understanding, if MODEL.USE_AFFINE=False (which means using SpatialBN), then CHECKPOINT.CONVERT_MODEL should be set as False. Is my understanding right?" Yes

"I see SpatialBN is 2d BN, can it be used in the model with 3d convolution?" Yes, for example, the 3D Conv at https://github.com/facebookresearch/video-long-term-feature-banks/blob/master/lib/models/model_builder_video.py#L176 uses the SpatialBN operator.

"If I want to finetue this converted model, should I finetune it with MODEL.USE_AFFINE True or False?" The reason for freezing BN (by setting USE_AFFINE=True) is that our batch size per GPU is small, so BN doesn't work well. If with your new model, your batch size is large enough (e.g. 8 per GPU), I think it'll work better with BN turned on (USE_AFFINE=False). If your batch size is mall ( < 4 per GPU), I'd guess using "CHECKPOINT.CONVERT_MODEL True" to convert the BN layers into frozen "affine layers" and train the frozen BNs by setting "USE_AFFINE=True" would work better.

"I ported a network from Pytorch to Caffe2 and converted the Pytorch version weight file to Caffe2 version weight, however, I cannot get the same result" I recommend double check and verify that the architecture defined in your PyTorch model is exactly the same the architecture defined in this repo (including details like striding, pooling size, etc. ). Our architecture is slightly different from the original non-local network (See also https://arxiv.org/pdf/1812.05038.pdf Appendix A).

taosean commented 4 years ago

Hi, @chaoyuaw , it's very nice of you to respond to my questions, thank you very much.

I have another question though, if I finetune the model with BN enabled, which means I set

MODEL:
  USE_BN: True
  USE_AFFINE: False
CHECKPOINT:
  CONVERT_MODEL: False

how should I set these 2 NONLOCAL related parameters?

NONLOCAL:
  USE_BN: False or True?
  USE_AFFINE: True or False?

Are USE_BN and USE_AFFINE parameters related to their counterparts in cfg.MODEL section?

Thanks!

chaoyuaw commented 4 years ago

If your original model uses a BN layer in NL and you don't want to freeze it, you set NONLOCAL.USE_BN: True and NONLOCAL.USE_AFFINE: False

I recommend taking a look at https://github.com/facebookresearch/video-long-term-feature-banks/blob/master/lib/models/nonlocal_helper.py#L146 to see exactly what these options imply.

taosean commented 4 years ago

Thanks @chaoyuaw , I understand, thank you.

facebookresearch / video-long-term-feature-banks

BN/Affine layer related issues #45