BVLC / caffe

Caffe: a fast open framework for deep learning.
http://caffe.berkeleyvision.org/
Other
34.14k stars 18.67k forks source link

Caffe hang when creating data layer #3965

Open royitaqi opened 8 years ago

royitaqi commented 8 years ago

I created a simplest net to learn the division "/" function (input is A and B, label is A/B). However, when I try to run the trainer, it hang forever. If I do killall caffe, I see that it's waiting for BlockingQueue. Searched around and it was mentioned (didn't note down the source) that it might be caused by the training and testing phase sharing the same lmdb. So I copied the same data to separate training and testing folders, but the problem persists.

Wondering why the hang, and how I should debug this problem?

Here is the console output:

[tw-mbp-rshi playgit]$ caffe train --solver solver.prototxt
I0408 21:57:11.489527 1949106944 caffe.cpp:178] Use CPU.
I0408 21:57:11.493430 1949106944 solver.cpp:48] Initializing solver from parameters:
test_iter: 1
test_interval: 2
base_lr: 0.01
display: 1
max_iter: 100
lr_policy: "inv"
gamma: 0.0001
power: 0.75
momentum: 0.9
weight_decay: 0.0005
snapshot: 5
snapshot_prefix: "snapshot"
solver_mode: CPU
net: "net.prototxt"
I0408 21:57:11.494869 1949106944 solver.cpp:91] Creating training net from net file: net.prototxt
I0408 21:57:11.495998 1949106944 net.cpp:313] The NetState phase (0) differed from the phase (1) specified by a rule in layer testing
I0408 21:57:11.496026 1949106944 net.cpp:313] The NetState phase (0) differed from the phase (1) specified by a rule in layer testing_label
I0408 21:57:11.496052 1949106944 net.cpp:49] Initializing net from parameters:
state {
  phase: TRAIN
}
layer {
  name: "training"
  type: "Data"
  top: "data"
  include {
    phase: TRAIN
  }
  data_param {
    source: "training"
    backend: LMDB
  }
}
layer {
  name: "training_label"
  type: "Data"
  top: "label"
  include {
    phase: TRAIN
  }
  data_param {
    source: "training_label"
    batch_size: 1
    backend: LMDB
  }
}
layer {
  name: "full"
  type: "InnerProduct"
  bottom: "data"
  top: "full"
  param {
    lr_mult: 1
    decay_mult: 1
  }
  param {
    lr_mult: 2
    decay_mult: 0
  }
  inner_product_param {
    num_output: 1
    weight_filler {
      type: "gaussian"
      std: 0.01
    }
    bias_filler {
      type: "constant"
      value: 0
    }
  }
}
layer {
  name: "loss"
  type: "EuclideanLoss"
  bottom: "full"
  bottom: "label"
  top: "loss"
}
I0408 21:57:11.496322 1949106944 layer_factory.hpp:77] Creating layer training
I0408 21:57:11.503118 1949106944 net.cpp:91] Creating Layer training
I0408 21:57:11.503237 1949106944 net.cpp:399] training -> data
I0408 21:57:11.504497 186691584 db_lmdb.cpp:38] Opened lmdb training
*** Aborted at 1460178183 (unix time) try "date -d @1460178183" if you are using GNU date ***
PC: @     0x7fff8f110136 __psynch_cvwait
*** SIGTERM (@0x7fff8f110136) received by PID 6373 (TID 0x7fff742d0300) stack trace: ***
    @     0x7fff89d17f1a _sigtramp
    @     0x7fff5850c620 (unknown)
    @        0x10784869b boost::condition_variable::wait()
    @        0x107849687 caffe::BlockingQueue<>::peek()
    @        0x1077b6f46 caffe::DataLayer<>::DataLayerSetUp()
    @        0x1077a640e caffe::BasePrefetchingDataLayer<>::LayerSetUp()
    @        0x1078148e7 caffe::Net<>::Init()
    @        0x107813385 caffe::Net<>::Net()
    @        0x10782f090 caffe::Solver<>::InitTrainNet()
    @        0x10782e3e7 caffe::Solver<>::Init()
    @        0x10782e0de caffe::Solver<>::Solver()
    @        0x10783e8a8 caffe::SGDSolver<>::SGDSolver()
    @        0x107844182 caffe::Creator_SGDSolver<>()
    @        0x1076f3137 train()
    @        0x1076f5721 main
    @     0x7fff90c165c9 start
Terminated: 15
[tw-mbp-rshi playgit]$

Here is my solver.prototxt:

# The train/test net protocol buffer definition
net: "net.prototxt"

# test_iter specifies how many forward passes the test should carry out.
# In the case of MNIST, we have test batch size 100 and 100 test iterations,
# covering the full 10,000 testing images.
test_iter: 1

# Carry out testing every 500 training iterations.
test_interval: 2

# The base learning rate, momentum and the weight decay of the network.
base_lr: 0.01
momentum: 0.9
weight_decay: 0.0005

# The learning rate policy
lr_policy: "inv"
gamma: 0.0001
power: 0.75

# Display every 100 iterations
display: 1

# The maximum number of iterations
max_iter: 100

# snapshot intermediate results
snapshot: 5
snapshot_prefix: "snapshot"

# solver mode: CPU or GPU
solver_mode: CPU

Here is my net.prototxt:


layer {
  name: "training"
  type: "Data"
  top: "data"
  include {
    phase: TRAIN
  }
  data_param {
    source: "training"
    backend: LMDB
  }
}

layer {
  name: "testing"
  type: "Data"
  top: "data"
  include {
    phase: TEST
  }
  data_param {
    source: "testing"
    backend: LMDB
  }
}

layer {
  name: "training_label"
  type: "Data"
  top: "label"
  include {
    phase: TRAIN
  }
  data_param {
    source: "training_label"
    batch_size:1
    backend: LMDB
  }
}

layer {
  name: "testing_label"
  type: "Data"
  top: "label"
  include {
    phase: TEST
  }
  data_param {
    source: "testing_label"
    batch_size:1
    backend: LMDB
  }
}

layer {
  name: "full"
  type: "InnerProduct"
  # learning rate and decay multipliers for the weights
  param { lr_mult: 1 decay_mult: 1 }
  # learning rate and decay multipliers for the biases
  param { lr_mult: 2 decay_mult: 0 }
  inner_product_param {
    num_output: 1
    weight_filler {
      type: "gaussian"
      std: 0.01
    }
    bias_filler {
      type: "constant"
      value: 0
    }
  }
  bottom: "data"
  top: "full"
}

layer {
  name: "loss"
  type: "EuclideanLoss"
  bottom: "full"
  bottom: "label"
  top: "loss"
}

Here is how I generated the training and label data:

import numpy as np
import lmdb
import caffe
import random

N = 100

# Let's pretend this is interesting data
X = np.zeros((N, 2, 1, 1), dtype=np.float)
y = np.zeros(N, dtype=np.float)

random.seed(0)

for i in range(0, N):
    X[i,0,0,0] = random.uniform(8, 10)
    X[i,1,0,0] = random.uniform(6, 8)
    y[i] = X[i,0,0,0] / X[i,1,0,0]

with lmdb.open('training', map_size=int(1e12)) as db:
    with db.begin(write=True) as transaction:
        for i in range(N):
            datum = caffe.proto.caffe_pb2.Datum()
            datum.channels = X.shape[1]
            datum.height = X.shape[2]
            datum.width = X.shape[3]
            datum.data = X[i].tobytes()
            str_id = '{:08}'.format(i)
            #
            # The encode is only essential in Python 3
            transaction.put(str_id.encode('ascii'), datum.SerializeToString())

with lmdb.open('label', map_size=int(1e12)) as db:
    with db.begin(write=True) as transaction:
        for i in range(N):
            datum = caffe.proto.caffe_pb2.Datum()
            datum.channels = 1
            datum.height = 1
            datum.width = 1
            datum.data = y[i].tobytes()
            str_id = '{:08}'.format(i)
            #
            # The encode is only essential in Python 3
            transaction.put(str_id.encode('ascii'), datum.SerializeToString())
royitaqi commented 8 years ago

Problem appeared to be solved by adding "batch_size: 1" to both training and testing data.

But still not sure why adding this will prevent the hang. Any insight from you guys will be helpful!

seanbell commented 8 years ago

batch_size should always be specified. I'm not sure what it means to have a net without a batchsize specified.

That being said, a better user interface would be to have caffe raise an error instead of hang. Feel free to PR this change.

royitaqi commented 8 years ago

@seanbell Sorry being a first timer on github: "PR"?

seanbell commented 8 years ago

It means to create a Pull Request.

Some docs: https://help.github.com/articles/using-pull-requests/ https://help.github.com/articles/creating-a-pull-request/

cdluminate commented 8 years ago

Can caffe report the reason e.g. missing batch_size in case of missing parameters required by caffe.proto? @seanbell