Open royitaqi opened 8 years ago
Problem appeared to be solved by adding "batch_size: 1" to both training and testing data.
But still not sure why adding this will prevent the hang. Any insight from you guys will be helpful!
batch_size
should always be specified. I'm not sure what it means to have a net without a batchsize specified.
That being said, a better user interface would be to have caffe raise an error instead of hang. Feel free to PR this change.
@seanbell Sorry being a first timer on github: "PR"?
It means to create a Pull Request.
Some docs: https://help.github.com/articles/using-pull-requests/ https://help.github.com/articles/creating-a-pull-request/
Can caffe report the reason e.g. missing batch_size
in case of missing parameters required by caffe.proto? @seanbell
I created a simplest net to learn the division "/" function (input is A and B, label is A/B). However, when I try to run the trainer, it hang forever. If I do
killall caffe
, I see that it's waiting forBlockingQueue
. Searched around and it was mentioned (didn't note down the source) that it might be caused by the training and testing phase sharing the same lmdb. So I copied the same data to separate training and testing folders, but the problem persists.Wondering why the hang, and how I should debug this problem?
Here is the console output:
Here is my
solver.prototxt
:Here is my
net.prototxt
:Here is how I generated the training and label data: