NVIDIA / caffe

Caffe: a fast open framework for deep learning.
http://caffe.berkeleyvision.org/
Other
672 stars 263 forks source link

About the loss between caffe and nvcaffe #496

Closed LCLL closed 6 years ago

LCLL commented 6 years ago

Hi, I am testing the nvcaffe on my project. I used 8GPU for image recognition training and found that nvcaffe is almost 10x times faster than original caffe. However, I found that the loss of nvcaffe is not similar with the caffe.

caffe around Iteration 10000:

I0325 11:09:58.038851 3644 solver.cpp:258] Train net output #0: loss = 0.0244867 ( 1 = 0.0244867 loss) I0325 11:09:58.038877 3644 sgd_solver.cpp:112] Iteration 10000, lr = 0.005 I0325 11:10:27.458719 3644 solver.cpp:239] Iteration 10050 (1.6996 iter/s, 29.4188s/50 iters), loss = 0.00117858 I0325 11:10:27.458811 3644 solver.cpp:258] Train net output #0: loss = 0.00117854 ( 1 = 0.00117854 loss) I0325 11:10:27.458863 3644 sgd_solver.cpp:112] Iteration 10050, lr = 0.005 I0325 11:10:29.678782 3698 blocking_queue.cpp:49] Waiting for data I0325 11:10:58.097903 3644 solver.cpp:239] Iteration 10100 (1.63196 iter/s, 30.6379s/50 iters), loss = 0.013431 I0325 11:10:58.097993 3644 solver.cpp:258] Train net output #0: loss = 0.013431 ( 1 = 0.013431 loss) I0325 11:10:58.098022 3644 sgd_solver.cpp:112] Iteration 10100, lr = 0.005 I0325 11:11:29.092669 3644 solver.cpp:239] Iteration 10150 (1.61324 iter/s, 30.9935s/50 iters), loss = 0.00049985 I0325 11:11:29.093052 3644 solver.cpp:258] Train net output #0: loss = 0.00049982 ( 1 = 0.00049982 loss) I0325 11:11:29.093111 3644 sgd_solver.cpp:112] Iteration 10150, lr = 0.005 I0325 11:11:58.154436 3644 solver.cpp:239] Iteration 10200 (1.72056 iter/s, 29.0603s/50 iters), loss = 0.00185934 I0325 11:11:58.154525 3644 solver.cpp:258] Train net output #0: loss = 0.00185931 (* 1 = 0.00185931 loss)

nvcaffe around Iteration 10000: I0411 11:21:44.974851 35133 solver.cpp:350] Iteration 10000 (0.877172 iter/s, 57.0013s/50 iter), loss = 0.387026 I0411 11:21:44.974885 35133 solver.cpp:374] Train net output #0: loss = 0.387027 ( 1 = 0.387027 loss) I0411 11:21:44.974895 35133 sgd_solver.cpp:172] Iteration 10000, lr = 0.005, m = 0.9, wd = 0.0005, gs = 1 I0411 11:21:48.221772 35133 solver.cpp:350] Iteration 10050 (15.3998 iter/s, 3.24679s/50 iter), loss = 0.384699 I0411 11:21:48.221846 35133 solver.cpp:374] Train net output #0: loss = 0.384699 ( 1 = 0.384699 loss) I0411 11:21:48.221863 35133 sgd_solver.cpp:172] Iteration 10050, lr = 0.005, m = 0.9, wd = 0.0005, gs = 1 I0411 11:21:51.362139 35133 solver.cpp:350] Iteration 10100 (15.9224 iter/s, 3.14024s/50 iter), loss = 1.12345 I0411 11:21:51.362215 35133 solver.cpp:374] Train net output #0: loss = 1.12345 ( 1 = 1.12345 loss) I0411 11:21:51.362233 35133 sgd_solver.cpp:172] Iteration 10100, lr = 0.005, m = 0.9, wd = 0.0005, gs = 1 I0411 11:21:54.840152 35133 solver.cpp:350] Iteration 10150 (14.3766 iter/s, 3.47788s/50 iter), loss = 0.412616 I0411 11:21:54.840212 35133 solver.cpp:374] Train net output #0: loss = 0.412616 ( 1 = 0.412616 loss) I0411 11:21:54.840224 35133 sgd_solver.cpp:172] Iteration 10150, lr = 0.005, m = 0.9, wd = 0.0005, gs = 1 I0411 11:21:58.699890 35133 solver.cpp:350] Iteration 10200 (12.9548 iter/s, 3.85958s/50 iter), loss = 1.11116

I used the same train_val and solver file, but get a different result. I wonder if I miss something when use the nvcaffe?

Thanks.

drnikolaev commented 6 years ago

Hi @LCLL - it's hard to say without seeing the prototxt files. Could you attach them? And complete NVCaffe log please.

LCLL commented 6 years ago

@drnikolaev can you give me some hints or suggestions about this problem? Thanks.

drnikolaev commented 6 years ago

@LCLL not yet, still digging...

LCLL commented 6 years ago

@drnikolaev can you reproduce the results? or anything i need to provide to help you digging? :)

LCLL commented 6 years ago

@drnikolaev any progress? is there something i can help?

drnikolaev commented 6 years ago

@LCLL may I have complete log attached please?

drnikolaev commented 6 years ago

@LCLL thank you. You run batch size 1 per GPU for NVCaffe but 8 per GPU for Caffe. Please re-try with everything identical and compare.

LCLL commented 6 years ago

@drnikolaev I set batch_size to 8 in both training prototxts. Is there any difference between nvcaffe and caffe for this parameter?

drnikolaev commented 6 years ago

@LCLL Yes, there is a difference. In NVCaffe you set global size, i.e. you should set 64.

LCLL commented 6 years ago

OK, got it. Thanks for your help:)