Open jmozah opened 9 years ago
Last I checked, there wasn't a publicly available version of their train_val.prototxt
. Lots of people have asked for it:
https://gist.github.com/ksimonyan/fd8800eeb36e276cd6f9#comment-1430126
https://gist.github.com/ksimonyan/211839e770f7b538e2d8#comment-1346808
https://gist.github.com/ksimonyan/3785162f95cd2d5fee77#comment-1316301
I think they probably just don't have it anymore. If you want to put together a version that successfully trains on multiple datasets successfully, then can test it and get it added to DIGITS.
Look at the bottom of this link.. @Karathy has a link there https://gist.github.com/ksimonyan/211839e770f7b538e2d8#file-readme-md
I will try and see if i can succesfully train a version
Hi @jmozah
Were you able to train VGG successfully? I think training using the pretrained model works but training from scratch does not converge.
If anyone has successfully trained VGG16 or VGG19 from scratch, can you share your solver and train_val files?
No... The network failed after 1 epoc... Will check it next week and update
Sent from my iPhone
On 16-Jul-2015, at 10:50 PM, serafett notifications@github.com wrote:
Hi @jmozah
Were you able to train VGG successfully? I think training using the pretrained model works but training from scratch does not converge.
If anyone has successfully trained VGG16 or VGG19 from scratch, can you share your solver and train_val files?
— Reply to this email directly or view it on GitHub.
@jmozah Any success?
No... not yet
I use a train_val that I updated from an old one. It works with the VGG 19 layers (with a very small batch). https://gist.github.com/groar/d455ebe671b2f1807659
I used it for fine-tuning, but never tried to train it from scratch. I could try.
Update on this:
@graphific uploaded a train_val.prototxt in the comments for this gist. I tried it on a 20-class subset of ImageNet (which should be easier to solve than the full imagenet dataset) and it totally failed to train (whereas AlexNet and GoogLeNet converge quickly every time).
So, still no luck here :-/
It would probably help to add Xavier weight initialization for this kind of deep network. With the default weight initialization the odds of hitting a vanishing gradient in the first layers are high.
Hi guys. Don't know if you still got problems with converging vggnet but for me initializing the weights did the trick, as @gheinrich suggested. Though, I used the standard initialization like it is done in the AlexNet.
Thanks! Can you post your .prototxt
? Did you use Gaussian intialization? Xavier or MSRA initializations should perform better (and you don't have to specify the standard deviation of the distribution on these). Some toy examples there.
You can find my .prototxt here. Yes, i used Gaussian. I trained on about 100.000 images (80% train, 20% val) with 64x64p with a batch size of 100. I used standard SGD, Gamma and LR. The dataset is private, dont know if it works on imagenet but i guess so. Note that the last output is 2 due to a binary class problem, for imagenet the fc8 layer should have an output of 1000.
I just noticed, that I used the VGGNet from BMVC-2014. Sorry for that. I will give feedback after I tryed it with the 16 layer network on the same dataset.
As @gheinrich suggested the VGGNet with 16 layers converges with the "xavier" weight initialization. You can find my train_val.prototxt file here. Note that I didnt train on the ImageNet dataset, but I had faced the same problem with convergence and was able to fix it with the "xavier" weight initialization. Parameters: Batch: 100, Image: 64x64, SGD: 6%, Gamma: 0.5, LR: 0.05. The last output is 2 due to a binary class problem, for ImageNet the fc8 layer should have an output of 1000.
Thanks for the update. That is nicely in line with the VGG paper:
Quote:
The initialisation of the network weights is important, since bad initialisation can stall learning due
to the instability of gradient in deep nets. To circumvent this problem, we began with training
the configuration A (Table 1), shallow enough to be trained with random initialisation. Then, when
training deeper architectures, we initialised the first four convolutional layers and the last three fully-
connected layers with the layers of net A (the intermediate layers were initialised randomly). We did
not decrease the learning rate for the pre-initialised layers, allowing them to change during learning.
For random initialisation (where applicable), we sampled the weights from a normal distribution
with the zero mean and 10e-2 variance. The biases were initialised with zero. It is worth
noting that after the paper submission we found that it is possible to initialise the weights without
pre-training by using the random initialisation procedure of Glorot & Bengio (2010).
Hi, I tried the train_val.prototxt
posted by @lfrdm and it works, thanks. I added the lr_mult=10/20
and decay_mult=1/0
params for the weights/biases to the fc8
layer . I was now wondering why these params are missing in the train_val.prototxt
and whether setting them to the same values as, e.g., in CaffeNet
, as I have done for fc8
, may make sense.
@igorbb you're right, in the train_val.prototxt, in all the pooling layers, the "pool: MAX" parameter is repeated twice. It must be a typo. After correcting this it seems to work.
Il 23/03/16 00:36, igorbb ha scritto:
Hwy @GiuliaP https://github.com/GiuliaP I am getting a parser error with @lfrdm https://github.com/lfrdm version. Can you share your gist ?
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/NVIDIA/DIGITS/issues/159#issuecomment-200080118
Hi I'm new to DIGITS and I'm experimenting with some datasets. When I tried the train_val.prototxt posted by @lfrdm with the changes mentioned by @GiuliaP (removing repeated pool :Max) I got this error message. Am I going wrong somewhere? Alex and GoogLeNet seem to be working fine.
ERROR: Check failed: error == cudaSuccess (2 vs. 0) out of memory
relu2_2 needs backward computation. conv2_2 needs backward computation. relu2_1 needs backward computation. conv2_1 needs backward computation. pool1 needs backward computation. relu1_2 needs backward computation. conv1_2 needs backward computation. relu1_1 needs backward computation. conv1_1 needs backward computation. label_data_1_split does not need backward computation. data does not need backward computation. This network produces output accuracy This network produces output loss Network initialization done. Solver scaffolding done. Starting Optimization Solving Learning Rate Policy: step Iteration 0, Testing net (#0) Check failed: error == cudaSuccess (2 vs. 0) out of memory
You have to reduce the batch size (both train and test/val): as it says, the GPU is out of memory.
@GiuliaP Reduced it and works well now. Thank you.
Did it converge?
./Zahoor@iPhone
On 23-Jun-2016, at 2:12 PM, Hariprasad Ravishankar notifications@github.com wrote:
@GiuliaP Reduced it and works well now. Thank you.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
yes it did. I ran it for 10 epochs on a data set consisting of 10k color images with a batch size of 10. It took an hour to complete and gave me a validation accuracy of 92%.
Hi, I'm trying to use VGG in DIGITS. When I tried to create the model, I get the following error:
_
ERROR: Layer 'loss' references bottom 'label' at the TEST stage however this blob is not included at that stage. Please consider using an include directive to limit the scope of this layer.
_
I just copied the train_val.prototxt provided by lfrdm to custom network and deleted the duplicated pool: MAX. Any idea? Thanks in advance, M
@mizadyya Read the documentation on how custom networks in DIGITS work by clicking on the blue question mark above the box.
You probably want to add something like this to your loss layer:
exclude { stage: "deploy" }
@lukeyeager I also needed to add softmax layer to the end, in addition to softmax with loss. Now it's running fine. Thanks
How much memory does it consume... Fits in 4gb card?
./Zahoor@iPhone
On 07-Jul-2016, at 9:15 AM, Ishant Mrinal Haloi notifications@github.com wrote:
I have tested this in Imagenet, it converges https://github.com/n3011/VGG_19_layers_Network
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
I couldn't make it work with batches as big as 5 256x256 images on a K520 with 4GB... And it also takes 5 days for 10 epochs on 18k images (finetuning)... maybe something is wrong with my EC2? GPU utilization is 99% constantly, memory peaked during initialization to near 100%, but quickly dropped to 60%... although larger batches made it fail for lack of memory (ended up using batches of 3)...
Also can't train VGG-16. Maybe it's because small batch size or solver settings(I use default DIGITS settings)? My dataset is from this kaggle competition: https://www.kaggle.com/c/dogs-vs-cats Here is my network definition: https://gist.github.com/mrgloom/fec835c5570e739eff8c18a343bdd7db
Seems that was small batch problem, I successfully trained VGG-16 with batch size 24 and batch accumulation 2, so as I understand my batch size was 48?
Here is the models and logs downloaded from DIGITS: https://github.com/mrgloom/kaggle-dogs-vs-cats-solution/tree/master/learning_from_scratch/Models/VGG-16 https://github.com/mrgloom/kaggle-dogs-vs-cats-solution/tree/master/learning_from_scratch/Models/VGG-19
Here is my prototxt, seems to work correctly.
Hi, I'm fairly new to DIGITS and to Caffe, and I have been trying to finetune VGG for the past few weeks without results. I used the prototxt posted by @lfrdm , setting the lr_mult parameters of the last layer to the values suggested by @GiuliaP and the lr_mult of the rest of the layers to 0. However, when running it in DIGITS it does not converge, it goes from 20% acc to 55% and it stays like that during the whole training. I've tried with several learning rates, from 0,01 to 0,0005 without success. My dataset consists on 8500 images for training and 1700 for validation, splitted into 5 classes. Could anyone give me a hand on this?
Hi @Elviish since your question isn't related to getting VGG to load in DIGITS but how to train it, can you post this question on the DIGITS users list (https://groups.google.com/forum/#!forum/digits-users).
Hi @lfrdm, I was looking for train_val files for vgg from bmvc 2014. I see that you have two commits for that file. Is the older one for bmvc version?
@lfrdm any chance you could post your prototxt file for VGG again, seems to be down, Thanks!
Echoing a request for this prototxt file for VGG.. can't seem to find one!
SImilar to LeNet,AlexNext, GoogLeNet... it would be good if VGG net is also added as one of the default networks to select from..