NVIDIA / DIGITS

Deep Learning GPU Training System
https://developer.nvidia.com/digits
BSD 3-Clause "New" or "Revised" License
4.12k stars 1.38k forks source link

Add VGG-16 net as one of the default network #159

Open jmozah opened 9 years ago

jmozah commented 9 years ago

SImilar to LeNet,AlexNext, GoogLeNet... it would be good if VGG net is also added as one of the default networks to select from..

lukeyeager commented 9 years ago

Last I checked, there wasn't a publicly available version of their train_val.prototxt. Lots of people have asked for it: https://gist.github.com/ksimonyan/fd8800eeb36e276cd6f9#comment-1430126 https://gist.github.com/ksimonyan/211839e770f7b538e2d8#comment-1346808 https://gist.github.com/ksimonyan/3785162f95cd2d5fee77#comment-1316301

I think they probably just don't have it anymore. If you want to put together a version that successfully trains on multiple datasets successfully, then can test it and get it added to DIGITS.

jmozah commented 9 years ago

Look at the bottom of this link.. @Karathy has a link there https://gist.github.com/ksimonyan/211839e770f7b538e2d8#file-readme-md

I will try and see if i can succesfully train a version

serafett commented 9 years ago

Hi @jmozah

Were you able to train VGG successfully? I think training using the pretrained model works but training from scratch does not converge.

If anyone has successfully trained VGG16 or VGG19 from scratch, can you share your solver and train_val files?

jmozah commented 9 years ago

No... The network failed after 1 epoc... Will check it next week and update

Sent from my iPhone

On 16-Jul-2015, at 10:50 PM, serafett notifications@github.com wrote:

Hi @jmozah

Were you able to train VGG successfully? I think training using the pretrained model works but training from scratch does not converge.

If anyone has successfully trained VGG16 or VGG19 from scratch, can you share your solver and train_val files?

— Reply to this email directly or view it on GitHub.

saeedizadi commented 9 years ago

@jmozah Any success?

jmozah commented 9 years ago

No... not yet

groar commented 9 years ago

I use a train_val that I updated from an old one. It works with the VGG 19 layers (with a very small batch). https://gist.github.com/groar/d455ebe671b2f1807659

I used it for fine-tuning, but never tried to train it from scratch. I could try.

lukeyeager commented 8 years ago

Update on this:

@graphific uploaded a train_val.prototxt in the comments for this gist. I tried it on a 20-class subset of ImageNet (which should be easier to solve than the full imagenet dataset) and it totally failed to train (whereas AlexNet and GoogLeNet converge quickly every time).

vgg-no-converge

So, still no luck here :-/

gheinrich commented 8 years ago

It would probably help to add Xavier weight initialization for this kind of deep network. With the default weight initialization the odds of hitting a vanishing gradient in the first layers are high.

lfrdm commented 8 years ago

Hi guys. Don't know if you still got problems with converging vggnet but for me initializing the weights did the trick, as @gheinrich suggested. Though, I used the standard initialization like it is done in the AlexNet.

gheinrich commented 8 years ago

Thanks! Can you post your .prototxt? Did you use Gaussian intialization? Xavier or MSRA initializations should perform better (and you don't have to specify the standard deviation of the distribution on these). Some toy examples there.

lfrdm commented 8 years ago

You can find my .prototxt here. Yes, i used Gaussian. I trained on about 100.000 images (80% train, 20% val) with 64x64p with a batch size of 100. I used standard SGD, Gamma and LR. The dataset is private, dont know if it works on imagenet but i guess so. Note that the last output is 2 due to a binary class problem, for imagenet the fc8 layer should have an output of 1000.

I just noticed, that I used the VGGNet from BMVC-2014. Sorry for that. I will give feedback after I tryed it with the 16 layer network on the same dataset.

lfrdm commented 8 years ago

As @gheinrich suggested the VGGNet with 16 layers converges with the "xavier" weight initialization. You can find my train_val.prototxt file here. Note that I didnt train on the ImageNet dataset, but I had faced the same problem with convergence and was able to fix it with the "xavier" weight initialization. Parameters: Batch: 100, Image: 64x64, SGD: 6%, Gamma: 0.5, LR: 0.05. The last output is 2 due to a binary class problem, for ImageNet the fc8 layer should have an output of 1000.

gheinrich commented 8 years ago

Thanks for the update. That is nicely in line with the VGG paper:

Quote:

The initialisation of the network weights is important, since bad initialisation can stall learning due
to  the  instability  of  gradient in  deep  nets.   To  circumvent this problem,  we  began with  training
the configuration A (Table 1), shallow enough to be trained with random initialisation. Then, when
training deeper architectures, we initialised the first four convolutional layers and the last three fully-
connected layers with the layers of net A (the intermediate layers were initialised randomly). We did
not decrease the learning rate for the pre-initialised layers, allowing them to change during learning.
For random initialisation (where applicable), we sampled the weights from a normal distribution
with the zero mean and 10e-2 variance. The biases were initialised with zero. It is worth
noting that after the paper submission we found that it is possible to initialise the weights without
pre-training by using the random initialisation procedure of Glorot & Bengio (2010).
GiuliaP commented 8 years ago

Hi, I tried the train_val.prototxt posted by @lfrdm and it works, thanks. I added the lr_mult=10/20 and decay_mult=1/0 params for the weights/biases to the fc8 layer . I was now wondering why these params are missing in the train_val.prototxt and whether setting them to the same values as, e.g., in CaffeNet, as I have done for fc8, may make sense.

GiuliaP commented 8 years ago

@igorbb you're right, in the train_val.prototxt, in all the pooling layers, the "pool: MAX" parameter is repeated twice. It must be a typo. After correcting this it seems to work.

Il 23/03/16 00:36, igorbb ha scritto:

Hwy @GiuliaP https://github.com/GiuliaP I am getting a parser error with @lfrdm https://github.com/lfrdm version. Can you share your gist ?

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/NVIDIA/DIGITS/issues/159#issuecomment-200080118

hariprasadravi commented 8 years ago

Hi I'm new to DIGITS and I'm experimenting with some datasets. When I tried the train_val.prototxt posted by @lfrdm with the changes mentioned by @GiuliaP (removing repeated pool :Max) I got this error message. Am I going wrong somewhere? Alex and GoogLeNet seem to be working fine.

ERROR: Check failed: error == cudaSuccess (2 vs. 0) out of memory

relu2_2 needs backward computation. conv2_2 needs backward computation. relu2_1 needs backward computation. conv2_1 needs backward computation. pool1 needs backward computation. relu1_2 needs backward computation. conv1_2 needs backward computation. relu1_1 needs backward computation. conv1_1 needs backward computation. label_data_1_split does not need backward computation. data does not need backward computation. This network produces output accuracy This network produces output loss Network initialization done. Solver scaffolding done. Starting Optimization Solving Learning Rate Policy: step Iteration 0, Testing net (#0) Check failed: error == cudaSuccess (2 vs. 0) out of memory

GiuliaP commented 8 years ago

You have to reduce the batch size (both train and test/val): as it says, the GPU is out of memory.

hariprasadravi commented 8 years ago

@GiuliaP Reduced it and works well now. Thank you.

jmozah commented 8 years ago

Did it converge?

./Zahoor@iPhone

On 23-Jun-2016, at 2:12 PM, Hariprasad Ravishankar notifications@github.com wrote:

@GiuliaP Reduced it and works well now. Thank you.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

hariprasadravi commented 8 years ago

yes it did. I ran it for 10 epochs on a data set consisting of 10k color images with a batch size of 10. It took an hour to complete and gave me a validation accuracy of 92%.

ghost commented 8 years ago

Hi, I'm trying to use VGG in DIGITS. When I tried to create the model, I get the following error:

_

ERROR: Layer 'loss' references bottom 'label' at the TEST stage however this blob is not included at that stage. Please consider using an include directive to limit the scope of this layer.

_

I just copied the train_val.prototxt provided by lfrdm to custom network and deleted the duplicated pool: MAX. Any idea? Thanks in advance, M

lukeyeager commented 8 years ago

@mizadyya Read the documentation on how custom networks in DIGITS work by clicking on the blue question mark above the box.

You probably want to add something like this to your loss layer:

  exclude { stage: "deploy" }

Example: https://github.com/NVIDIA/DIGITS/blob/digits-4.0/digits/standard-networks/caffe/lenet.prototxt#L162-L184

ghost commented 8 years ago

@lukeyeager I also needed to add softmax layer to the end, in addition to softmax with loss. Now it's running fine. Thanks

jmozah commented 8 years ago

How much memory does it consume... Fits in 4gb card?

./Zahoor@iPhone

On 07-Jul-2016, at 9:15 AM, Ishant Mrinal Haloi notifications@github.com wrote:

I have tested this in Imagenet, it converges https://github.com/n3011/VGG_19_layers_Network

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Motherboard commented 8 years ago

I couldn't make it work with batches as big as 5 256x256 images on a K520 with 4GB... And it also takes 5 days for 10 epochs on 18k images (finetuning)... maybe something is wrong with my EC2? GPU utilization is 99% constantly, memory peaked during initialization to near 100%, but quickly dropped to 60%... although larger batches made it fail for lack of memory (ended up using batches of 3)...

mrgloom commented 8 years ago

Also can't train VGG-16. Maybe it's because small batch size or solver settings(I use default DIGITS settings)? My dataset is from this kaggle competition: https://www.kaggle.com/c/dogs-vs-cats Here is my network definition: https://gist.github.com/mrgloom/fec835c5570e739eff8c18a343bdd7db

mrgloom commented 8 years ago

Seems that was small batch problem, I successfully trained VGG-16 with batch size 24 and batch accumulation 2, so as I understand my batch size was 48?

Here is the models and logs downloaded from DIGITS: https://github.com/mrgloom/kaggle-dogs-vs-cats-solution/tree/master/learning_from_scratch/Models/VGG-16 https://github.com/mrgloom/kaggle-dogs-vs-cats-solution/tree/master/learning_from_scratch/Models/VGG-19

HolmesShuan commented 8 years ago

Here is my prototxt, seems to work correctly.

eamadord commented 8 years ago

Hi, I'm fairly new to DIGITS and to Caffe, and I have been trying to finetune VGG for the past few weeks without results. I used the prototxt posted by @lfrdm , setting the lr_mult parameters of the last layer to the values suggested by @GiuliaP and the lr_mult of the rest of the layers to 0. However, when running it in DIGITS it does not converge, it goes from 20% acc to 55% and it stays like that during the whole training. I've tried with several learning rates, from 0,01 to 0,0005 without success. My dataset consists on 8500 images for training and 1700 for validation, splitted into 5 classes. Could anyone give me a hand on this?

gheinrich commented 8 years ago

Hi @Elviish since your question isn't related to getting VGG to load in DIGITS but how to train it, can you post this question on the DIGITS users list (https://groups.google.com/forum/#!forum/digits-users).

aytackanaci commented 7 years ago

Hi @lfrdm, I was looking for train_val files for vgg from bmvc 2014. I see that you have two commits for that file. Is the older one for bmvc version?

aaron276h commented 7 years ago

@lfrdm any chance you could post your prototxt file for VGG again, seems to be down, Thanks!

gaving commented 7 years ago

Echoing a request for this prototxt file for VGG.. can't seem to find one!