TheLegendAli / DeepLab-Context

Other
239 stars 97 forks source link

DeepLabV2-ResNet101 - loss going up during training #13

Closed JustinLiang closed 7 years ago

JustinLiang commented 7 years ago

I set up the DeepLabV2-ResNet101 model from the bitbucket code (so not using this python implementation) and during training I notice that the loss is going up. As I run this program for longer, the loss approaches 350. I was wondering if anyone would have any idea what could be the cause of this. Here is my run_pascal.sh script, I did not modify anything past the ## Training #1 (on train_aug) comment:

#/bin/sh

## MODIFY PATH for YOUR SETTING
ROOT_DIR=/deeplab

CAFFE_DIR=../code
CAFFE_BIN=${CAFFE_DIR}/.build_release/tools/caffe.bin

EXP=voc12

if [ "${EXP}" = "voc12" ]; then
    NUM_LABELS=21
    DATA_ROOT=${ROOT_DIR}/data/VOCdevkit/VOC2012
else
    NUM_LABELS=0
    echo "Wrong exp name"
fi

## Specify which model to train
########### voc12 ################
NET_ID=deeplabv2_resnet101

## Variables used for weakly or semi-supervisedly training
#TRAIN_SET_SUFFIX=
TRAIN_SET_SUFFIX=_aug

TRAIN_SET_STRONG=train
#TRAIN_SET_STRONG=train200
#TRAIN_SET_STRONG=train500
#TRAIN_SET_STRONG=train1000
#TRAIN_SET_STRONG=train750

TRAIN_SET_WEAK_LEN=0 #5000

DEV_ID=0

#####

## Create dirs

CONFIG_DIR=${EXP}/config/${NET_ID}
MODEL_DIR=${EXP}/model/${NET_ID}
mkdir -p ${MODEL_DIR}
LOG_DIR=${EXP}/log/${NET_ID}
mkdir -p ${LOG_DIR}
export GLOG_log_dir=${LOG_DIR}

## Run

RUN_TRAIN=1
RUN_TEST=0
RUN_TRAIN2=0
RUN_TEST2=0

Furthermore, I have the SegmentationClassAug ground truth images from your dropbox link (https://www.dropbox.com/s/oeu149j8qtbs1x0/SegmentationClassAug.zip?dl=0) in the data/VOCdevkit/VOC2012/ folder.

Here is what the training looks like:

I1127 01:09:18.232293 31928 net.cpp:270] This network produces output accuracy
I1127 01:09:18.232307 31928 net.cpp:270] This network produces output accuracy_res05
I1127 01:09:18.232322 31928 net.cpp:270] This network produces output accuracy_res075
I1127 01:09:18.232336 31928 net.cpp:270] This network produces output accuracy_res1
I1127 01:09:18.348408 31928 net.cpp:283] Network initialization done.
I1127 01:09:18.355520 31928 solver.cpp:60] Solver scaffolding done.
I1127 01:09:18.399452 31928 caffe.cpp:129] Finetuning from voc12/model/deeplabv2_resnet101/init.caffemodel
I1127 01:09:19.118443 31928 net.cpp:816] Ignoring source layer fc1_coco
I1127 01:09:19.118538 31928 net.cpp:816] Ignoring source layer fc1_coco_fc1_coco_0_split
I1127 01:09:19.132285 31928 caffe.cpp:219] Starting Optimization
I1127 01:09:19.132372 31928 solver.cpp:280] Solving deeplabv2_resnet101
I1127 01:09:19.132383 31928 solver.cpp:281] Learning Rate Policy: poly
I1127 01:09:25.117411 31928 solver.cpp:229] Iteration 0, loss = 261.803
I1127 01:09:25.117547 31928 solver.cpp:245]     Train net output #0: accuracy = 0.0291829
I1127 01:09:25.117589 31928 solver.cpp:245]     Train net output #1: accuracy = 0.0872093
I1127 01:09:25.117614 31928 solver.cpp:245]     Train net output #2: accuracy = 0.66929
I1127 01:09:25.117645 31928 solver.cpp:245]     Train net output #3: accuracy_res05 = 0.046683
I1127 01:09:25.117672 31928 solver.cpp:245]     Train net output #4: accuracy_res05 = 0.140741
I1127 01:09:25.117697 31928 solver.cpp:245]     Train net output #5: accuracy_res05 = 0.669494
I1127 01:09:25.117720 31928 solver.cpp:245]     Train net output #6: accuracy_res075 = 0.035668
I1127 01:09:25.117766 31928 solver.cpp:245]     Train net output #7: accuracy_res075 = 0.106589
I1127 01:09:25.117786 31928 solver.cpp:245]     Train net output #8: accuracy_res075 = 0.669965
I1127 01:09:25.117815 31928 solver.cpp:245]     Train net output #9: accuracy_res1 = 0.0486381
I1127 01:09:25.117841 31928 solver.cpp:245]     Train net output #10: accuracy_res1 = 0.145349
I1127 01:09:25.117866 31928 solver.cpp:245]     Train net output #11: accuracy_res1 = 0.627653
I1127 01:09:25.117985 31928 sgd_solver.cpp:106] Iteration 0, lr = 0.00025
I1127 01:10:54.876021 31928 solver.cpp:229] Iteration 20, loss = 336.749
I1127 01:10:54.877192 31928 solver.cpp:245]     Train net output #0: accuracy = 0
I1127 01:10:54.877221 31928 solver.cpp:245]     Train net output #1: accuracy = 0
I1127 01:10:54.877240 31928 solver.cpp:245]     Train net output #2: accuracy = 0.857143
I1127 01:10:54.877261 31928 solver.cpp:245]     Train net output #3: accuracy_res05 = 0
I1127 01:10:54.877290 31928 solver.cpp:245]     Train net output #4: accuracy_res05 = 0
I1127 01:10:54.877307 31928 solver.cpp:245]     Train net output #5: accuracy_res05 = 0.857143
I1127 01:10:54.877341 31928 solver.cpp:245]     Train net output #6: accuracy_res075 = 0
I1127 01:10:54.877368 31928 solver.cpp:245]     Train net output #7: accuracy_res075 = 0
I1127 01:10:54.877393 31928 solver.cpp:245]     Train net output #8: accuracy_res075 = 0.857143
I1127 01:10:54.877419 31928 solver.cpp:245]     Train net output #9: accuracy_res1 = 0
I1127 01:10:54.877435 31928 solver.cpp:245]     Train net output #10: accuracy_res1 = 0
I1127 01:10:54.877454 31928 solver.cpp:245]     Train net output #11: accuracy_res1 = 0.857143
I1127 01:10:54.877490 31928 sgd_solver.cpp:106] Iteration 20, lr = 0.000249775
JustinLiang commented 7 years ago

It appears that there was some sort of dependency issue with the server I was running it on.

mjohn123 commented 7 years ago

Hi JustinLiang, I am also running the ResNet 101. It run well in traning without error. However, when I start to run the testing, it has error about memory

I0101 00:03:58.796654 15999 caffe.cpp:252] Running for 1449 iterations.
F0101 00:03:59.323478 15999 syncedmem.cpp:56] Check failed: error == cudaSuccess (2 vs. 0)  out of memory
*** Check failure stack trace: ***
    @     0x7fdd0dd67daa  (unknown)
    @     0x7fdd0dd67ce4  (unknown)
    @     0x7fdd0dd676e6  (unknown)
    @     0x7fdd0dd6a687  (unknown)
    @     0x7fdd0e3cadb8  caffe::SyncedMemory::to_gpu()
    @     0x7fdd0e3c9e89  caffe::SyncedMemory::mutable_gpu_data()
    @     0x7fdd0e392f62  caffe::Blob<>::mutable_gpu_data()
    @     0x7fdd0e47c008  caffe::BaseConvolutionLayer<>::forward_gpu_gemm()
    @     0x7fdd0e59c5c6  caffe::ConvolutionLayer<>::Forward_gpu()
    @     0x7fdd0e3a4dc2  caffe::Net<>::ForwardFromTo()
    @     0x7fdd0e3a4ed7  caffe::Net<>::ForwardPrefilled()
    @           0x4074f7  test()
    @           0x405c88  main
    @     0x7fdd0d378f45  (unknown)
    @           0x406327  (unknown)
    @              (nil)  (unknown)
Aborted (core dumped)

How many memory do we need to run the testing model? I do not know why training can run but testing cannot run. I am using TitanX pascal 12GB

amlarraz commented 7 years ago

+1 any solution?

JustinLiang commented 7 years ago

mjohn123, are your training and testing images the same size? What size are they?

mjohn123 commented 7 years ago

@JustinLiang, i am using original deeplab code. You are right. The training is 321 but testing is 531. Can i reduce it? Because the original image is more than 500

lee2430 commented 7 years ago

@JustinLiang I've run into the same 'out of memory' problem, and I've tried forwarding on a single image with very small size, the problem was still thrown, Is there any solution?

amlarraz commented 7 years ago

@lee2430 I got it to work in a machine with four TitanX (12Gb each one) but it's too expensive, I recommended you to try it in Tensorflow because it manages the memory better...

lee2430 commented 7 years ago

@amlarraz Thanks for your advice. I just see your comment posted 20 days ago today, but i've been learning tensorflow like what you said, cuz i'm a newbee to it :). Glad to know i'm in the right path.

realwecan commented 7 years ago

@mjohn123 @amlarraz @JustinLiang @lee2430 Sorry for interrupting guys. I have run into the same "out of memory" issue. Just wondering if any of you were able to run the tests with ResNet-101 on machines with one TITAN X GPU? In addition, is it OK for me to simply reduce the test image dimension from 531 (in the test prototxt) to 321 (same as in the training prototxt)? If I reduce the image dimension is there anything else I should also change? Thanks!

lee2430 commented 7 years ago

@realwecan Hi, I tried a tensorflow version and it now works smoothly. https://github.com/DrSleep/tensorflow-deeplab-resnet

cslxiao commented 7 years ago

Sorry to bother you @JustinLiang . I am wondering why the training log outputs 12 output values while the network only produces 4 outputs? Why there are two 0s before the true accuracy?