Tushar-N / pytorch-resnet3d

I3D Nonlocal ResNets in Pytorch
245 stars 39 forks source link

About caffe to pytorch image input? #8

Closed FloydEdwin closed 4 years ago

FloydEdwin commented 5 years ago

hello ! I have a question about image input.it seems that caffe is BGR,while pytorch is RGB.i see your code input Kinetics video don't transpose the RGB --->BGR。 is this important ? What do you think about that?

Tushar-N commented 5 years ago

The log file for the run in the caffe2 repo has a use_bgr=False. It would be important if the channel values are swapped.

FloydEdwin commented 4 years ago

The log file for the run in the caffe2 repo has a use_bgr=False. It would be important if the channel values are swapped.

Thanks for your reply. Recently i have used your transformed pytorch model for finetune.I have exped on Kinetics, UCF and Charades .but the finetune result is not to be what I expected.(UCF:56% top1, Charades:18% mAP,)note: exp: kinetics:72.5% /74% NL is ok. Also it may be some mistakes in my code。 But my job is to read, and convert the data to Model.I guess i may overlooked some of the details of your model .So I asked your question about channel input(BGR/RGB) a few days ago. .Is there any advices should I pay attention to about the fine tuning work.Or something I need to be aware of your model. could you give me some advices .Thanks.

Tushar-N commented 4 years ago

I've been able to finetune without a problem on EPIC-Kitchens though I haven't tried UCF/Charades. Did you freeze batchnorm by uncommenting this line when you finetune?

ShiroKL commented 4 years ago

I tried with UCF and it seems to work too. Are you sure of your dataset loader ? (the equivalent of kinetics.py but adapted for your dataset)

FloydEdwin commented 4 years ago

I tried with UCF and it seems to work too. Are you sure of your dataset loader ? (the equivalent of kinetics.py but adapted for your dataset)

Thanks for reply.It seemed that there are something wrong in my code .Would you mind share some information to me? 1. What is the accuracy of your experiment(on UCF split1)? 2.How many frames are selected in a video?(selected by uniform or formed by randomly cropping out 64 consecutive frames from the original full-length video and then dropping every other frame ...) 3. How are you do video Inference? Sample 10 or even full video?

FloydEdwin commented 4 years ago

I've been able to finetune without a problem on EPIC-Kitchens though I haven't tried UCF/Charades. Did you freeze batchnorm by uncommenting this line when you finetune?

Thanks for your reply.I have freeze BN in NL paper's exp setting.There must be something wrong in my code. I will try to find it. I set normalize vaule: input_mean = [0.485, 0.456, 0.406] input_std = [0.229, 0.224, 0.225] I get it from TSN. In video , Is it a bad choices?

ShiroKL commented 4 years ago

I tried with UCF and it seems to work too. Are you sure of your dataset loader ? (the equivalent of kinetics.py but adapted for your dataset)

Thanks for reply.It seemed that there are something wrong in my code .Would you mind share some information to me? 1. What is the accuracy of your experiment(on UCF split1)? 2.How many frames are selected in a video?(selected by uniform or formed by randomly cropping out 64 consecutive frames from the original full-length video and then dropping every other frame ...) 3. How are you do video Inference? Sample 10 or even full video?

1) around 90% 2) I did not change the value so it should be 32 and selected uniformly (excepted I misunderstanding the code) 3) I followed the eval.py file for the evaluation using the clip parameters. If I tried to do it on the entire video it takes too much hours. ( several hours instead of 10 or 20 minutes)

I remove the normalisation of the image in util.py "gtransforms.GroupNormalize(mean, std)". Instead, I used Batchnormalisation3d at the input of the network. I do not think that will change something but just in case I prefer to let you know.

Tushar-N commented 4 years ago

I would use the same normalization scheme as video-nonlocal does. util.clip_transform() already does this for you, and there's no need to change this.

I set normalize vaule: input_mean = [0.485, 0.456, 0.406] input_std = [0.229, 0.224, 0.225] I get it from TSN.

The model was trained with different input and normalization values. The inputs are in the range [0-255] (not [0-1]) and the mean, std values are

mean = [114.75, 114.75, 114.75]
std = [57.375, 57.375, 57.375]

See util.py for more info.

FloydEdwin commented 4 years ago

I would use the same normalization scheme as video-nonlocal does. util.clip_transform() already does this for you, and there's no need to change this.

I set normalize vaule: input_mean = [0.485, 0.456, 0.406] input_std = [0.229, 0.224, 0.225] I get it from TSN.

The model was trained with different input and normalization values. The inputs are in the range [0-255] (not [0-1]) and the mean, std values are

mean = [114.75, 114.75, 114.75]
std = [57.375, 57.375, 57.375]

See util.py for more info.

hello! I'm sorry to bother you again.I solved all my error in finetune work about 10 days ago.But Today i have a question about the network itself. I find your resnet.py self.conv1 = nn.Conv3d(3, 64, kernel_size=(5, 7, 7), stride=(2, 2, 2), padding=(2, 3, 3), bias=False) however when i checked the Nonlocal (caffe2): conv_blob = model.ConvNd( data, 'conv1', 3, 64, [1 + use_temp_convs_set[0][0] 2, 7, 7], strides=[temp_strides_set[0][0], 2, 2], pads=[use_temp_convs_set[0][0], 3, 3] 2, weight_init=('MSRAFill', {}), bias_init=('ConstantFill', {'value': 0.}), no_bias=1 ) Here use_temp_convs_set[0][0]=2 , so we get (5,7,7) ,That is OK! But temp_strides_set[0][0]=1 so stride is (1,2,2) != (2,2,2) Also ,The same problem appear in self.maxpool1 = nn.MaxPool3d(kernel_size=(2, 3, 3), stride=(2, 2, 2), padding=(0, 0, 0)),the stride is (1,3,3) not (2,3,3) I am confused about this !I hope you can solve my question.

Also I find the output every layer is: torch.Size([4, 64, 32, 112, 112])→conv1 torch.Size([4, 256, 32, 55, 55])→pool1 torch.Size([4, 256, 16, 55, 55]) →res2 torch.Size([4, 256, 16, 55, 55]) →pool2 torch.Size([4, 512, 16, 28, 28]) →res3 torch.Size([4, 1024, 16, 14, 14]) →res4 torch.Size([4, 2048, 16, 7, 7]) →res5

the [32,55,55] != [32,56,56] I think this may due to the pooling difference between pytorch and caffe! What is your opinion?Thank you!

Tushar-N commented 4 years ago

I'm actually not too sure about the differences in caffe2/pytorch ops, but you might be onto something. As far as I know, the output activation size in the caffe2 model is also [32,55,55] for pool1. You can test this by running python -m utils.layer_by_layer --model r50 and printing out the sizes of the activations for each network.

Also, FAIR has recently released their SlowFast codebase which has I3D + NL model definitions as well. It's probably a good idea to use their code, or look at their video_model_builder.py for reference if you want to continue using this simplified codebase.