fengfu-chris / caffe-twns

Implementation of Ternary Weight Networks In Caffe
https://arxiv.org/abs/1605.04711
63 stars 23 forks source link

question about training CIFAR10 #19

Open namedBen opened 6 years ago

namedBen commented 6 years ago

您好,当我在用论文提供的网络结构以及初始学习率训练cifar10的时候,发现无法训练,Loss爆炸了直接nan。您的VGG7参考网络结构为“2×(128-C3) + MP2 +2×(256-C3) + MP2 + 2×(512-C3) + MP2 + 1024-FC + Softmax。想请教您两个问题: 1.在1024-FC层之前,特征图的大小为batch 512 4 * 4,请问这个1024FC是如何做到把8192变成10的维度的? 2.其次,按照BPWNs的网络结构 2×1024F C)−10SVM,以您参考的base_lr=0.1训练的loss是nan,请问这是什么原因? 蟹蟹

fengfu-chris commented 6 years ago

@namedBen 分析如下:

  1. 1024FC的输入是上一层512x4x4拉成一个行向量(8192维),输出是1024个节点;之后再接一个全连接层,输入是1024,输出是类别数,这里是10;之后用Softmax层作用一下。
  2. loss爆炸的时候可以考虑降低lr。比如MNIST数据集,从lr=0.01开始。
namedBen commented 6 years ago

首先谢谢您的回复,对于第一点,我也是这么实现的,对于第二点,目前我是用VGG7训练CIFAR10,但是根据您论文所提供的训练策略训练full percision weight networks时,取initial learning rate=0.1,optimizer=SGD,训练是loss爆炸的(epoch1就产生)。请问您具体是如何实现FPWNs达到92.88的?

fengfu-chris commented 6 years ago

您可以把主要训练参数和log信息贴一下吗?

namedBen commented 6 years ago

网络模型定义如下: (module): VGG( (layer): Sequential( (0): BasicBlock( (conv1): Conv2d (3, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True) ) (1): BasicBlock( (conv1): Conv2d (128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True) ) (2): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), dilation=(1, 1)) (3): BasicBlock( (conv1): Conv2d (128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True) ) (4): BasicBlock( (conv1): Conv2d (256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True) ) (5): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), dilation=(1, 1)) (6): BasicBlock( (conv1): Conv2d (256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True) ) (7): BasicBlock( (conv1): Conv2d (512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True) ) (8): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), dilation=(1, 1)) ) (classifier1): Linear(in_features=8192, out_features=1024) (classifier3): Linear(in_features=1024, out_features=10) )

训练log日志如下: Training [epoch:1, iter:1, load:10/50000] LR:1.0e-01 Loss: 2.297 | Acc: 30.000% Training [epoch:1, iter:2, load:20/50000] LR:1.0e-01 Loss: 88.866 | Acc: 20.000% Training [epoch:1, iter:3, load:30/50000] LR:1.0e-01 Loss: 205.516 | Acc: 16.667% Training [epoch:1, iter:4, load:40/50000] LR:1.0e-01 Loss: 8440.584 | Acc: 12.500% Training [epoch:1, iter:5, load:50/50000] LR:1.0e-01 Loss: 214870.594 | Acc: 10.000% Training [epoch:1, iter:6, load:60/50000] LR:1.0e-01 Loss: 195956608.000 | Acc: 10.000% Training [epoch:1, iter:7, load:70/50000] LR:1.0e-01 Loss: 110850178285568.000 | Acc: 8.571% Training [epoch:1, iter:8, load:80/50000] LR:1.0e-01 Loss: 16265785521874871328440320.000 | Acc: 7.500% Training [epoch:1, iter:9, load:90/50000] LR:1.0e-01 Loss: nan | Acc: 7.778% Training [epoch:1, iter:10, load:100/50000] LR:1.0e-01 Loss: nan | Acc: 9.000% Training [epoch:1, iter:11, load:110/50000] LR:1.0e-01 Loss: nan | Acc: 8.182% Training [epoch:1, iter:12, load:120/50000] LR:1.0e-01 Loss: nan | Acc: 8.333% Training [epoch:1, iter:13, load:130/50000] LR:1.0e-01 Loss: nan | Acc: 8.462%

训练超参数: SGD: base_lr=0.1, momentum=0.9, weight_decay=1e-4 batch_size: 10

fengfu-chris commented 6 years ago

这是用pytorch框架实现的吗?跟原始caffe的repo有几点差别:

  1. 我用的batch size是100 (v.s. 10);
  2. 我的conv层是有bias的 (v.s. no bias);
  3. BatchNorm的实现不一样(这是重点!!!);
namedBen commented 6 years ago

嗯嗯,用的是pytorch 1.batch size我试过,100和10都会nan 2.您说的conv层有bias,我一开始想过,但是后来觉得量化操作应该要去除bias,才能准确评价量化Weight的影响 3.这个我还真没注意到,谢谢提醒!之后我会用caffe或者改一下pytorch的BN层重新训练下,看看效果。 再次谢谢您的解答!(比心心)