Boyiliee / AOD-Net

AOD-Net (Pytorch & Caffe)
135 stars 39 forks source link

issue with training AOD-Net #5

Closed YogeshShitole closed 5 years ago

YogeshShitole commented 6 years ago

Hi @Boyiliee while training AOD-Net with NYU2 database "Loss is not converging" I prepared train script but during training loss is not converging I am training network with NYU2 database with 27, 256 as mentioned in your paper I have used your test_template.prototxt as training.prototxt and I am training network with 150000 iteration ~ ((27,567/batchsize=8)*40 epoch) as Euclidean loss layer is used in prototxt loss is not converging over the period it remain same around 15000 to 16000 upon training completion I even checked for inference with trained model_iter_150000.caffemodel instead of your AOD_Net.caffemodel but it is producing completely white image as output instead of giving dehazed image

below is my solver.prototxt content

net:"Training.prototxt" base_lr: 0.001 lr_policy: "fixed" display: 20 max_iter: 150000 momentum: 0.9 weight_decay: 0.0001 snapshot: 15000 snapshot_prefix: "models/model" solver_mode: GPU type: "SGD"

Training.prototxt is same as test_template.prototxt with below modification input_dim: 1 -----> input_dim: {batchSize} for data/label layer and for each conv layer i included weight_filler in convolution_param weight_filler {{ type: "gaussian" }}

My train script

import os import numpy as np from pylab import * import re import random import cv2 print cv2.version import ntpath ntpath.basename('a/b/c') def path_leaf(path): head, tail = ntpath.split(path) return tail or ntpath.basename(head)

Train_DIR = '../data/AODtrain/training/' Label_DIR = '../data/AODtrain/original/'

Network Training parameters for input Image data

height = 480 width = 640 batch = 8 # batch size

import sys sys.path.append("/home/ubuntu/Tools/caffe/python/") import caffe

def EditFcnProto(templateFile, height, width, batch_size): with open(templateFile, 'r') as ft: template =

print templateFile

    outFile = 'Training.prototxt'
    with open(outFile, 'w') as fd:

def createBatch(img_dir, label_dir, batch_size): batchdata = [] labelbatchdata = [] for i in range(batch_size): fname = random.choice(os.listdir(img_dir)) imagepath = Train_DIR + fname

print fname

    # print imagepath
    labelpath = label_dir + fname.split('_')[0] + '_' + fname.split('_')[1] + '.jpg'
    # print labelpath

    npstore =
    labelstore =

    data = npstore
    data = data.transpose((2, 0, 1))
    label = labelstore
    label = label.transpose((2, 0, 1))
return batchdata, labelbatchdata

def train(): caffe.set_mode_gpu() caffe.set_device(0)

# training.prototxt is same as test_template.prototxt with only modification for data/label layer is
# input_dim: 1 -----> input_dim: {batchSize}
templateFile = 'train_template.prototxt'
EditFcnProto(templateFile, height, width, batch)

solver = caffe.SGDSolver('solver.prototxt')
# solver = caffe.get_solver('solver.prototxt')

niter = 150000
train_loss = np.zeros(niter)

f = open('loss.txt', 'w')

for it in range(niter):
    batchdata, labelbatchdata = createBatch(Train_DIR, Label_DIR, batch)['data'].data[...] = batchdata;['label'].data[...] = labelbatchdata;
    train_loss[it] =['loss'].data
    f.write('{0: d} '.format(it))
    f.write('{0: f}\n'.format(train_loss[it]))


if name == 'main': train()

snapshot of output

I0530 15:41:02.473088 2152 sgd_solver.cpp:112] Iteration 51940, lr = 0.001 I0530 15:41:07.138154 2152 solver.cpp:239] Iteration 51960 (4.28732 iter/s, 4.66492s/20 iters), loss = 157563 I0530 15:41:07.138191 2152 solver.cpp:258] Train net output #0: loss = 157563 (* 1 = 157563 loss) I0530 15:41:07.138197 2152 sgd_solver.cpp:112] Iteration 51960, lr = 0.001 I0530 15:41:11.762470 2152 solver.cpp:239] Iteration 51980 (4.32514 iter/s, 4.62412s/20 iters), loss = 156092

I am confused what is going wrong here loss is not converging, please tell me what I am doing wrong and suggest how to proceed basically I am trying to reproduce your paper for learning


asfix commented 6 years ago

I am also interesting to this issue. So, I hope the authors reply it and share a guideline for the training.

asfix commented 6 years ago

@YogeshShitole did you solve the problem? I think the problem is related to either passing pixel values higher than 1 or showing them without rescaling. I mean, the pixel values must be in range between 0 to 1. then the algorithm amplifies them during image showing process. If you see only white images then the values must be divided by 255.

Boyiliee commented 6 years ago

Sorry for the late reply. Thanks for your interest in AOD-Net. @asfix Yes, you are right.

asfix commented 6 years ago

Hi @Boyiliee .could you please provide me an email of yours? Btw. I thank you.

Boyiliee commented 6 years ago

You can find in my website including more details about AOD-Net. Thanks.