YuwenXiong / py-R-FCN

R-FCN with joint training and python support
MIT License
1.05k stars 471 forks source link

training error with layer issues #10

Closed brisker closed 7 years ago

brisker commented 8 years ago

@Orpine F1013 12:05:22.696523 14673 net.cpp:784] Cannot copy param 0 weights from layer 'rpn_conv/3x3'; shape mismatch. Source param shape is 512 1024 3 3 (4718592); target param shape is 512 2048 3 3 (9437184). To learn this layer's parameters from scratch rather than copying from a saved net, rename the layer. * Check failure stack trace: *

YuwenXiong commented 8 years ago

Sorry, I made a silly mistake. it seems that you didn't use OHEM, and please modify this line: https://github.com/Orpine/py-R-FCN/blob/master/models/pascal_voc/ResNet-101/rfcn_end2end/train_agonistic.prototxt#L6929 to bottom: "res4b22"

brisker commented 8 years ago

@Orpine I use ResNet-50, how to fix this?

YuwenXiong commented 8 years ago

@brisker You need to change https://github.com/Orpine/py-R-FCN/blob/master/models/pascal_voc/ResNet-50/rfcn_end2end/test_agonistic.prototxt#L3532 to bottom: res5c, but the real reason is that I forgot to modify https://github.com/Orpine/py-R-FCN/blob/master/models/pascal_voc/ResNet-50/rfcn_end2end/train_agonistic.prototxt#L3532 to bottom: res4f

brisker commented 8 years ago

@Orpine I modified following you advice, but new error occurs: F1014 13:34:19.352131 5063 net.cpp:784] Cannot copy param 0 weights from layer 'rfcn_cls'; shape mismatch. Source param shape is 1029 1024 1 1 (1053696); target param shape is 98 1024 1 1 (100352). To learn this layer's parameters from scratch rather than copying from a saved net, rename the layer. * Check failure stack trace: *

YuwenXiong commented 8 years ago

@brisker Please upload your prototxt (both train and test) and log to https://gist.github.com/, let me check it.

brisker commented 8 years ago

@Orpine here https://gist.github.com/brisker/66bb0775defb82e9b4255727b6eba887

YuwenXiong commented 8 years ago

Seems like those prototxts have no problem. I wonder if you use my demo model to finetune? That will raise a error since my demo model contains a rfcn_cls layer whose output channel is 1029. You should check Preparation for Training & Testing step 7, download ResNet-50 and ResNet-101 imagenet pretrained model manually(from https://github.com/KaimingHe/deep-residual-networks).

Another solution is that you could change the layer name from rfcn_cls to any other name, like rfcn_cls_binary, then Caffe will reinitialize this layer rather than try to copy weights. You also need to rename rfcn_bbox since I modified this layer's weights when I snapshot, you cannot continue training on the weights.

Simonhong111 commented 8 years ago

hello ,my graphics cars is GTX 1060 6GB.however when I run the code on windows ,I got the error lilke "error == cudaSuccess (2 vs. 0) out of memory".what should I do ,and the modification will result a lower accuracy.thanks

YuwenXiong commented 8 years ago

Hi @Simonhong111 , there maybe several reason for your situation. The first one is you must use cudnn to reduce your GPU memory cost. And you also need to exit any application that may occupy large GPU memory. On my machine R-FCN with ResNet-101 will use 5.5GB GPU memory. I think it is possible for GTX 1060 to run it. Otherwise you could try ResNet-50.

jhung0 commented 7 years ago

Should I change https://github.com/Orpine/py-R-FCN/blob/master/models/try1/ResNet-50/rfcn_end2end/class-aware/train_ohem.prototxt#L3532 as well? Getting the same error as OP.

dantp-ai commented 7 years ago

@Orpine I have tried the second approach where rfcn_cls and rfcn_bbox need to be renamed. However I am intersted also in trying out the first approach, but I can not find Preparation for Training & Testing Step 7 on the mentioned URL ?

YuwenXiong commented 7 years ago

@plopd Preparation for Training & Testing Step 7 is on https://github.com/Orpine/py-R-FCN/blob/master/README.md, the mentioned URL is to help you find where to download them.

foralliance commented 6 years ago

@YuwenXiong "Another solution is that you could change the layer name from rfcn_cls to any other name, like rfcn_cls_binary, then Caffe will reinitialize this layer rather than try to copy weights. You also need to rename rfcn_bbox since I modified this layer's weights when I snapshot, you cannot continue training on the weights." What does this passage mean and how should it be understood? many many thanks