dmlc / cxxnet

move forward to https://github.com/dmlc/mxnet
Other
1.03k stars 415 forks source link

problems when training ImageNet #241

Open gclouding opened 9 years ago

gclouding commented 9 years ago

I installed cxxnet successfully and it can run on Mnist dataset. When I tried cxxnet on ImageNet Dataset, it can still run, but it may halt unexpectedly and randomly, sometimes when loading the mean_224.bin, sometimes when training some batches of images, as shown below. And no error prompted, just stayed with no changes in number of batches and time elapsed, and the same status lasts more than several hours. I can only kill the process with Ctrl+C. The memory of the computer is enough with 5GB used by cxxnet and 2GB left free. default

I used kaiming.conf as shown below when running cxxnet.

Configuration for ImageNet

Acknowledgement:

Ref: He, Kaiming, and Jian Sun. "Convolutional Neural Networks at Constrained Time Cost." CVPR2015

J' model in the paper above

data = train iter = imgrec

image_rec = "train.bin" image_mean = "mean_224.bin" rand_crop=1 rand_mirror=1 min_crop_size=192 max_crop_size=224 max_aspect_ratio=0.3 iter = threadbuffer iter = end

eval = test iter = imgrec

image_rec = "test.bin" image_mean = "mean_224.bin"

iter = end

Stage 1

netconfig=start layer[0->1] = conv:conv1 kernel_size = 7 stride = 2 nchannel = 64 layer[1->2] = relu:relu1 layer[2->3] = max_pooling kernel_size = 3

Stage 2

layer[3->4] = conv:conv2 nchannel = 128 kernel_size = 2 stride = 3 layer[4->5] = relu:relu2

layer[5->6] = conv:conv3 nchannel = 128 kernel_size = 2 pad = 1 layer[6->7] = relu:relu3

layer[7->8] = conv:conv4 nchannel = 128 kernel_size = 2 layer[8->9] = relu:relu4

layer[9->10] = conv:conv5 nchannel = 128 kernel_size = 2 pad = 1 layer[10->11] = relu:relu5

layer[11->12] = max_pooling:pool1 kernel_size = 3

Stage 3

layer[12->13] = conv:conv6 nchannel = 256 kernel_size = 2 stride = 2 layer[13->14] = relu:relu6

layer[14->15] = conv:conv7 nchannel = 256 kernel_size = 2 pad = 1 layer[15->16] = relu:relu7

layer[16->17] = conv:conv8 nchannel = 256 kernel_size = 2 layer[17->18] = relu:relu8

layer[18->19] = conv:conv9 nchannel = 256 kernel_size = 2 pad = 1 layer[19->20] = relu:relu9

layer[20->21] = max_pooling:pool2 kernel_size = 3

Stage 4

layer[21->22] = conv:conv10 nchannel = 2304 kernel_size = 2 stride = 3 layer[22->23] = relu:relu10

layer[23->24] = conv:conv11 nchannel = 256 kernel_size = 2 pad = 1 layer[24->25] = relu:relu11

Stage 5

layer[25->26,27,28,29] = split:split1 layer[26->30] = max_pooling:pool3 kernel_size = 1 stride = 1 layer[27->31] = max_pooling:pool4 kernel_size = 2 stride = 2 layer[28->32] = max_pooling:pool5 kernel_size = 3 stride = 3 layer[29->33] = max_pooling:pool6 kernel_size = 6 stride = 6

layer[30->34] = flatten:f1 layer[31->35] = flatten:f2 layer[32->36] = flatten:f3 layer[33->37] = flatten:f4 layer[34,35,36,37->38] = concat:concat1

Stage 6

layer[38->39] = fullc:fc1 nhidden = 4096 layer[39->40] = relu:relu12 layer[40->40] = dropout threshold = 0.5

layer[40->41] = fullc:fc2 nhidden = 4096 layer[41->42] = relu:relu13 layer[42->42] = dropout threshold = 0.5

layer[42->43] = fullc:fc3 nhidden = 1000 layer[43->43] = softmax:softmax1 netconfig=end

evaluation metric

metric = rec@1 metric = rec@5

max_round = 100 num_round = 100

input shape not including batch

input_shape = 3,224,224

batch_size = 256

global parameters in any sectiion outside netconfig, and iter

momentum = 0.9 wmat:lr = 0.01 wmat:wd = 0.0005

bias:wd = 0.000 bias:lr = 0.02

all the learning rate schedule starts with lr

lr:schedule = factor lr:gamma = 0.1 lr:step = 300000

save_model=1 model_dir=models print_step=1

random config

random_type = xavier

new line

dev = gpu:0,1

dev = cpu

antinucleon commented 9 years ago

Sorry we are going to deprecate cxxnet project and fully transfer to MXNET in next a few days (or next week). We are still working on some final doc and final debug/feature. If you like you make take a look of https://github.com/dmlc/mxnet/blob/master/example/imagenet/README.md