YuwenXiong / py-R-FCN

R-FCN with joint training and python support
MIT License
1.05k stars 471 forks source link

psroi_pooling_layer.cu:108 invalid configuration argument #30

Open harrycrossincode opened 7 years ago

harrycrossincode commented 7 years ago

Hi,

I met the issue during ResNet-50 training: psroi_pooling_layer.cu:108] Check failed: error == cudaSuccess (9 vs. 0) invalid configuration argument

Ubuntu 14.04 + one 1080 GPU card.

Any idea on the issue? Thanks.

ravikantb commented 7 years ago

Hi @harrycrossincode , Did you solve your issue? I also have same issue with same configuration while testing. I have two 1080 cards but I guess we can't use both of them due to python layers. Please let me know here in case you have solved it. If i get to solve it first then will update here. Thanks.

haihaoshen commented 7 years ago

You may check the annotation information. Please note that the roi is valid. When they are fixed, training is fine now.

ravikantb commented 7 years ago

@haihaoshen Are you implying that the rois for test images (which in my understanding are stored in 'stage3_rpn_final_proposals.pkl') are invalid? If yes, then I wonder what could cause that. Is my training flawed or something else, any idea? Please correct me if I misunderstood what you are trying to convey. Thanks!

haihaoshen commented 7 years ago

It should be train set. Please use end2end mode as it is simpler.

ravikantb commented 7 years ago

Actually I was able to train the model using alternate training approach using the script provided (py-R-FCN/experiments/scripts/rfcn_alt_opt_5stage_ohem.sh) without any problem. But in the end of this script it tries to test the trained model on test set, which is failing with the above error.

My understanding for the training and testing phase is that ROIs for both the training and test set are computed during training steps only. And once the training is over it tries to calculate mAP on test set using these ROIs. This belief comes from the fact that 'rfcn_test.pt', which is used for testing, does not have RPN layers and 'HAS_RPN' flag is set to False during testing. But we need ROIs from somewhere to proceed. This script did not work for me but I inserted RPN layers in 'rfcn_test.pt' and then tested the model on single images and it worked (though not as good as I would have liked to). I have ResNet-101 on training right now, hope it will work.

On a sidenote, since you have a similar configuration as mine, would you be interested in helping me with some more observations I had with my set up?

stanstarks commented 7 years ago

@ravikantb Just run into this problem too. Turned on debug mode of ProposalTargetLayer and found that it sampled 0 fg and 0 bg. After modifying the valid image criteria in fast_rcnn.train.filter_roidb, I am able to train my own model without OHEM.

ivansong1988 commented 7 years ago

@stanstarks I have met the same problem and I wonder what you actually do to modify the valid image criteria?

Timonzimm commented 7 years ago

Has anyone found what @stanstarks means by modifying the image criteria in filter_roidb? I actually run into the same issue when trying to use my own dataset (fg num: 0 and bg num: 0).

junx1992 commented 6 years ago

looking forward to the reply!!