Closed mender05 closed 9 years ago
Maybe you should take the fc8 features like this: net.blobs['fc8'].data[4].copy()
@jiangdong123 Thank you for your advice! I'll try it. This is my previous code: I take the output of fc8 like this (just a BATCH_SIZE of test images):
net.set_input_arrays(\
data4D.astype(np.float32), data4DL.astype(np.float32))
pred = net.forward()
for i in range(0,BATCH_SIZE):
for c in range(0,28):
pred_normal[i][c] = pred['fc8'][i][c][0][0]
print pred_normal
Is there any mistake?
The loss variation looks very strange: What causes the loss changing periodically?
Probably your training data is not randomized.
On Thursday, November 13, 2014, mender05 notifications@github.com wrote:
The loss variation looks very strange: [image: figure_1] https://cloud.githubusercontent.com/assets/7811449/5040242/bab9f878-6be6-11e4-99dc-675424c37cbd.png What causes the loss changing periodically?
— Reply to this email directly or view it on GitHub https://github.com/BVLC/caffe/issues/1396#issuecomment-63000773.
Sergio
@jiangdong123
As @sguada said, the training data is not randomized. I'll randomize them before training.
@sguada @jiangdong123 The reason is that labels are interpreted inproperly! for example, the following 2 groups of 28-dim label:
label_1 = { 189, 116, 165, 259, 95, 144, 122, 151, 88, 125, 218, 160, 68, 32, 95, 110, 165, 266, 123, 32, 151, 182, 189, 284, 294, 218, 173, 157 }
label_2 = { 64, 71, 91, 115, 126, 105, 24, 51, 92, 144, 170, 197, 114, 132, 188, 138, 97, 103, 148, 201, 20, 29, 30, 39, 68, 99, 34, 22 }
are interpreted as:
label_1 = { 189, 0, 0, 0, 116, 0, 0, 0, 165, 0, 0, 0, 3, 1, 0, 0, 95, 0, 0, 0, 144, 0, 0, 0, 122, 0, 0, 0 }
label_2 = { 64, 0, 0, 0, 71, 0, 0, 0, 91, 0, 0, 0, 115, 0, 0, 0, 126, 0, 0, 0, 105, 0, 0, 0, 24, 0, 0, 0 }
It seems that CAFFE reads the input label by byte
. As a result, the 259
is read as 3,1,0,0
.
In the little-endian machine, 259
is 3,1,0,0
in memory.
Previously, my labels are stored to lmdb in this way:
int datum_size = sizeof(int)*28;
data_file.read(str_buffer, datum_size);
...
datum.set_data(str_buffer, datum_size);
datum.SerializeToString(&value);
...
mdb_data.mv_data = reinterpret_cast<void*>(&value[0]);
mdb_put(mdb_txn, mdb_dbi, &mdb_key, &mdb_data, 0)
CAFFE uses c++ template: template <typename Dtype>
. How can I specify the Dtype
to be int
?
I have corrected the labels, changed the input type to float and randomlized the training samples, but this problem still there. a period == 2400 iterations. A iteration processes 2400*30 = 72000 images. There are 22000 training images, which is equivalent to 72000/22000 = 3.3 epochs
When you shuffle the training data did you made sure the labels align?
Can you increase the batch size? Also try to increase the dropout.
On Sunday, November 23, 2014, mender05 notifications@github.com wrote:
I have corrected the labels, changed the input type to float and randomlized the training samples, but this problem still there. [image: figure_1] https://cloud.githubusercontent.com/assets/7811449/5161721/8ee3061a-73eb-11e4-8c93-48e7c7bee80d.jpg a period == 2400 iterations. A iteration processes 2400*30 = 72000 images. There are 22000 training images, which is equivalent to 72000/22000 = 3.3 epochs
— Reply to this email directly or view it on GitHub https://github.com/BVLC/caffe/issues/1396#issuecomment-64158231.
Sergio
Thank you @sguada.
@mender05 you could also try https://github.com/shelhamer/caffe/tree/accum-grad to allow having bigger batch size by doing several iterations before updating the gradients.
@sguada I have tried this branch. But what parameters should be set to enable bigger batch size? After batch_size was increased from 30 to 35, it ran out of memory.
F1126 16:43:02.337970 7332 syncedmem.cpp:51] Check failed: error == cudaSuc cess (2 vs. 0) out of memory
In the solver.prototxt add
iter_size: 2
That will mean that it would do 2 iterations of batch_size: 30
before updating the weights. This means that effectively you would using a batch_size: 60
.
You can change your batch_size
and iter_size
to define the desired batch_size
.
It is so strange. As the batch_size increasing from 30 to 60, the loss variation pattern changed, but still periodic.
There must be something weird with your data, the loss decrease very quickly and then oscillates with periodicity. Could you shuffle your data again?
Sergio
2014-11-28 3:34 GMT-08:00 mender05 notifications@github.com:
It is so strange. As the batch_size increasing from 30 to 60, the loss variation pattern changed, but still periodic.
[image: figure_1] https://cloud.githubusercontent.com/assets/7811449/5227188/dac48942-7732-11e4-8386-0cbd491695be.png
— Reply to this email directly or view it on GitHub https://github.com/BVLC/caffe/issues/1396#issuecomment-64884867.
I use the snapshot at the 2000th iteration to predict, the outputs are all the same.
array([[ 0.49006659, 0.48892561, 0.49674234, 0.52244973, 0.52458155,
0.52957731, 0.46845111, 0.47450158, 0.49067837, 0.52837992,
0.53714836, 0.54056102, 0.52498746, 0.50657398, 0.53844237,
0.5057267 , 0.42278934, 0.42133904, 0.50450838, 0.5381543 ,
0.45289528, 0.42029274, 0.37055418, 0.36709356, 0.41887969,
0.44862145, 0.32116845, 0.36128747],
[ 0.49006659, 0.48892561, 0.49674234, 0.52244973, 0.52458155,
0.52957731, 0.46845111, 0.47450158, 0.49067837, 0.52837992,
0.53714836, 0.54056102, 0.52498746, 0.50657398, 0.53844237,
0.5057267 , 0.42278934, 0.42133904, 0.50450838, 0.5381543 ,
0.45289528, 0.42029274, 0.37055418, 0.36709356, 0.41887969,
0.44862145, 0.32116845, 0.36128747],
[ 0.49006659, 0.48892561, 0.49674234, 0.52244973, 0.52458155,
0.52957731, 0.46845111, 0.47450158, 0.49067837, 0.52837992,
0.53714836, 0.54056102, 0.52498746, 0.50657398, 0.53844237,
0.5057267 , 0.42278934, 0.42133904, 0.50450838, 0.5381543 ,
0.45289528, 0.42029274, 0.37055418, 0.36709356, 0.41887969,
0.44862145, 0.32116845, 0.36128747],
Outputs of the final model at the 36500th iteration:
array([[ 0.482418 , 0.48542902, 0.49439543, 0.52315784, 0.52507049,
0.52752018, 0.47199821, 0.47462174, 0.49217641, 0.52927047,
0.54133612, 0.54410964, 0.52102458, 0.50839245, 0.53855455,
0.5059635 , 0.41948465, 0.4194364 , 0.50593352, 0.53848571,
0.44772175, 0.41696107, 0.36593205, 0.36593369, 0.41697961,
0.44766867, 0.31933263, 0.36117038],
[ 0.482418 , 0.48542902, 0.49439543, 0.52315784, 0.52507049,
0.52752018, 0.47199821, 0.47462174, 0.49217641, 0.52927047,
0.54133612, 0.54410964, 0.52102458, 0.50839245, 0.53855455,
0.5059635 , 0.41948465, 0.4194364 , 0.50593352, 0.53848571,
0.44772175, 0.41696107, 0.36593205, 0.36593369, 0.41697961,
0.44766867, 0.31933263, 0.36117038],
[ 0.482418 , 0.48542902, 0.49439543, 0.52315784, 0.52507049,
0.52752018, 0.47199821, 0.47462174, 0.49217641, 0.52927047,
0.54133612, 0.54410964, 0.52102458, 0.50839245, 0.53855455,
0.5059635 , 0.41948465, 0.4194364 , 0.50593352, 0.53848571,
0.44772175, 0.41696107, 0.36593205, 0.36593369, 0.41697961,
0.44766867, 0.31933263, 0.36117038],
hi, @mender05 do you mind to show some code on how you make prediction on test data and get the array in your last post?
Hi, I have the same problem. I am using regression for video processing and therefore I used 9 consecutive frames as input of of the network. I changed convert_imageset.cpp to store data as 9 frames in each blob, reading data in train_val.prototxt as:
name: "CaffeNet"
layers {
name: "data"
type: DATA
top: "data"
top: "label"
data_param {
source: "examples/project/train_lmdb"
backend: LMDB
batch_size: 256
}
transform_param {
crop_size: 227
mean_file: "examples/project//train_mean.binaryproto"
mirror: true
}
include: { phase: TRAIN }
}
layers {
name: "data"
type: DATA
top: "data"
top: "label"
data_param {
source: "examples/project//val_lmdb"
backend: LMDB
batch_size: 50
}
transform_param {
crop_size: 227
mean_file: "examples/project/train_mean.binaryproto"
mirror: false
}
include: { phase: TEST }
}
and changed the accuracy layer to EUCLIDEAN_LOSS in train_val.prototxt
layers {
name: "fc8"
type: INNER_PRODUCT
bottom: "fc7"
top: "fc8"
blobs_lr: 1
blobs_lr: 2
weight_decay: 1
weight_decay: 0
inner_product_param {
num_output: 1
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0
}
}
}
layers {
name: "loss"
type: EUCLIDEAN_LOSS
bottom: "fc8"
bottom: "label"
top: "loss"
}
for deploying I used:
input: "data"
input_dim: 10
input_dim: 9
input_dim: 227
input_dim: 227
layers {
name: "conv1"
type: CONVOLUTION
bottom: "data"
top: "conv1"
convolution_param {
num_output: 96
kernel_size: 11
stride: 4
}
}
<rest the same>
layers {
name: "fc8"
type: INNER_PRODUCT
bottom: "fc7"
top: "fc8"
inner_product_param {
num_output: 1
}
}
I have
base_lr: 0.001
batch_size: 256 for train
batch_size: 50 for val
The same is like image_net network. I have the same loss behavior as @mender05. It decreased dramatically at first and then fluctuated till end. I have not shuffled the data, and labels are integer 1 to 100. To test, I used Matlab interface, i.e. read 9 images, concatenate them together and use
scores = matcaffe_demo(imgFrames, 1);
As I am cropping the images, the result is a score vector with length of 10, all having the same value, e.g. 71.4674 regardless of input images. I also tried different snapshots of the network and the result changed a bit but still the same for all crops, all images.
@mender05, could you solve your problem? Do you still have the same output for all images?
@sguada, do I do every steps right for the regression? I am going to shuffle my data but I don't know if it is due to shuffling or something else!
A possible explanation is that the model is not learning much, it probably got trapped in a local minimum which is similar to random weights.
Try to change the way you initialize the weights, change gaussian
to xavier
for the convolutional layers.
Thanks @sguada,
I changed the weights from gaussian
to xavier
. But it gives me nan
loss even with learning rate 0.001. I've read that people decreased lr to overcome nan loss, however I am afraid if I decrease lr more than 0.001 my network doesn't learn at all.
I will work on shuffling data and see if it changes anything.
BTW, I have about 2700 inputs (each has 9 images). considering 10 crops for each input, network is only trained with about 27000 inputs. Do you think it can be the reason of trapping in local minima?
Don't worry about decreasing the learning rate, it is relative to the magnitude of the loss, which in case of euclidean loss can be huge. And yes having only 9 images will cause problems of overfitting.
All questions about usage, installation, code, and applications should be searched for and asked on the caffe-users mailing list.
@mender05 have you ever solved this problem? I also meet it. I do not think its related to hyperparameters.
@OnlySang I met the same problem recently while I used the AlexNet for training a 2-category classifier. When I used the model to test my images with python interface, I always got the same output. finelly, I set the outputnum of fc7 to 1000, and it became normal. I don't understand why absolutely, but hope it's useful to you!
@OnlySang, I had the same problem. I decreased learning rate and shuffled data. The problem solved.
@wizardcsy binary classification using alexnet? I feel you make things complcated. small model may fit it.
@mollahosseini I tried what u have tried. but they didn't work. thanks for ur advices.
@OnlySang I agree with you. I do not think its related to hyper parameters too.
@wizardcsy @mollahosseini I have not solved the problem. But after I changed the training and test dataset, this problem disappeared. According to my experience, it is difficult to directly regress a image to a pose vector. Besides, you may try a sampler network which is easier to train.
@StevenLOL This is my prediction code:
###################
NUMBER = 1000
CHANNEL = 3
HEIGHT = 220
WIDTH = 220
###################
# read test image #
###################
...
# test[number,chanel,height,width]
...
#############################
# predict using caffe model #
#############################
# make sure that caffe is on the python path
CAFFE_ROOT = '/home/mender/caffe-master/'
import sys
sys.path.insert(0, CAFFE_ROOT + 'python')
import caffe
# set path to test model file and trained model
MODEL_FILE = './deeppose_test.prototxt'
TRAINED_MODEL = './caffenet_train_iter_36500.caffemodel'
net = caffe.Net(MODEL_FILE, TRAINED_MODEL)
#net.set_phase_test()
data4D = np.ones([1,CHANNEL,HEIGHT,WIDTH])
data4DL = np.zeros([1,14,1,1])
pred_normal = np.zeros([NUMBER,14])
n = 0
for n in range(0, NUMBER):
data4D[0] = test[n]
data4DL[0][0][0][0] = n
net.set_input_arrays(\
data4D.astype(np.float32), data4DL.astype(np.float32))
pred = net.forward()
for c in range(0,14):
pred_normal[n][c] = pred['fc8'][0][c][0][0]
np.save('prediction_36500it.npy', pred_normal)
@mender05 When I use theano and Lasagne, which u can find them on github, the regression go to convergence. The main architecture of the network is the same, as well as training pipe. So why different realization make different results?
Hi, @mender05 thank you for posting the code.
@mender05 I have the same problem and I check the filter weight of middle layer. I turns out that the filter weights are all 0. Do you know why?
Hi, I am also getting the same error filter weights are all 0 did you find a solution? @mender05 can u share your full train.prototxt
@mender05 did you ever find a solution? I'm having the same problem with periodicity and constant output...
could you share the code prepare the train data and test data? I also have the same problem. Thanks
@mender05 could you share the code prepare the train data and test data? I also have the same problem. Thanks
@mender05 have you successful implemented the deeppose? Could you share the code for data prepare?
@sguada what do you mean by "Don't worry about decreasing the learning rate, it is relative to the magnitude of the loss" ? I find that smaller lr will lead to a better convergence under some condition, but theoretically small lr may make a local minimum, why not ?
@mender05 Have you solved this problem? I'm doing regression with caffe, I suffered from the same problem as you. No matter what input is, the value is always one same value. The only possibility I can think is that the weights and bias of network is 0.
Anyone who solve this problem, please help.
@mender05, @sguada
did you manage to solve the problem by modifying the protoxts (train & test)? if yes, can you please share them?
the Net seems similar to AlexNet but there are subtle variations and I run into problems that are mentioned earlier by others also. Effectively stuck! Any help would be greatly appreciated. thanks -:)
I also have this issue when I use C++ command for predict.
@ginobilinie @kshalini
I've been doing regression from images and fixed the problem by scaling the pixel values down by 255 and subtracting the dataset mean (so the pixel values are now between {-1,1}), and also by scaling the labels down so they were between {0,1}. I also set all my new layer weights (I was transfer learning from AlexNet) to be initialized using the 'type:xavier' parameter.
Hope this helps!
@JoeMWatson
thanks for the post. but didnt quite follow fully. my specific questions are: a) do you use LMDB or hdf5 for inputs? b) did you use the same train_val.prototxt as mentioned by @mender05 ? if not, can you please share yours for reference c) finally, can you also share the few lines of Python code to interpret the output labels you get from the net?
thanks
@JoeMWatson
Thanks. In fact, I've already scale the label to [0,1], and the input image data to [-1,1], but I still found the predicted output value is the same. I analyze the trained model and the test data in each blob(do a forward pass), I found the bias dominates the output value, and the layer before the last layer usually goes to almost 0.
Hi, I have solved my problem. In my case, the problem comes from the initialization of network. I change weight filler: gaussian->xavier, and set bias filler: constant 0. Then the problem is solved.
@ginobilinie Thanks, that did the trick.
another solver here. I recognized there is lack of non-linearity in my model. So, I add some more FC layer and dropout with ReLU activation function. then it did the better performance.
@ginobilinie I have the same problem. The learned weight is zero everywhere and the output is constant. I guess the bias dominate the net. How do you solve this problem? ( I already change weight filler to xavier and set bias filler: constant 0). Shall I disable bias_term ?
@lood339 In my case, when I set the bias to 0, then the training is normal... What about yours?
@ginobilinie When I set eh bias to 0, it has the same problem. Then, I change the weight_decay to a small number (0.0005), then it is normal. I think the if the weight_decay is large (like 0.5), all the weight will eventually becomes zero in my case. I did another modification that helps me. I set the learning rate in convolutional layer as 0 because I transfer weights from pre-trained model. So that the weights in convolutional layer won't change during the training.
@lood339 Good. I always set the weight_decay very small. If you just want to fine-tune some layers, you should set the learning rate of other layers to be 0.
@ginobilinie But why std:0.01 doesn't work? Why periodical loss?
I try to ues caffe to implement the DeepPose proposed in this paper: http://arxiv.org/abs/1312.4659 DeepPose has 3 stages. And each stage is almost the same as AlexNet (DeepPose changes the loss layer in AlexNet to euclidean loss). It is a regression problem in fact.
The train.prototxt is:
The solve.prototxt is:
After trainning completed, I use python interface to do prediction on testset. The test.prototxt is:
but the output is very strange. Dumpping the output of "fc8" layer, I find that all the images produce the same output:
In fact, no mater what the inputs are, the outputs are always the same with the values above. How the problem caused?