facebookarchive / caffe2

Caffe2 is a lightweight, modular, and scalable deep learning framework.
https://caffe2.ai
Apache License 2.0
8.42k stars 1.94k forks source link

MNIST tutorial does not train in Jupyter Notebook, runs fine from terminal #2272

Open MatthewInkawhich opened 6 years ago

MatthewInkawhich commented 6 years ago

Machine: MacBook Pro 2015 OS: MacOS High Sierra Jupyter: Version 4.4.0 Caffe2: Conda CPU-only install Python: Version 2.7.14 Anaconda

When running the MNIST.ipynb tutorial out of a Jupyter Notebook, my model does not train. My training accuracy either starts at 1.0, or it starts as expected (~0.10) and immediately goes to 1.0. Meanwhile, when I print the loss I see "nan".

When I copied the code over to a python script and ran via terminal (Mac), it trains as intended.

I also tried copying my code from a training script that I wrote( and know works) into one block in a Jupyter Notebook as a check. Again, when run out of the Notebook, training accuracy goes to 1.0, while loss is "nan". I am not encountering this issue on my other Mac laptop, on which Caffe2 was last compiled in January.

Here is the training loop in my script:

# Initialize and create the training network
workspace.RunNetOnce(train_model.param_init_net)
workspace.CreateNet(train_model.net)

# Run training
for i in range(training_iters):
    workspace.RunNet(train_model.net)
    print("loss:", workspace.FetchBlob('loss'))
    print("accuracy:", workspace.FetchBlob('accuracy'))
    print("\n")

Snippet of output from my script when run out of script from terminal:

iter: 0
loss: 2.3173385
accuracy: 0.08

iter: 1
loss: 2.2902696
accuracy: 0.17

iter: 2
loss: 2.3026712
accuracy: 0.13

iter: 3
loss: 2.3027816
accuracy: 0.13

...

Snippet of output from my script when run out of Jupyter Notebook:

iter: 0
loss: 2.3038619
accuracy: 0.1

iter: 1
loss: nan
accuracy: 1.0

iter: 2
loss: nan
accuracy: 1.0

iter: 3
loss: nan
accuracy: 1.0

...

Here are the graphs from the MNIST tutorial Notebook:

screen shot 2018-03-14 at 7 45 18 pm screen shot 2018-03-13 at 8 25 20 pm

Has anyone else encountered this and/or have a solution?

Thanks, Matt

henryguyu commented 6 years ago

I failed at "workspace.RunNetOnce(train_model.param_initnet)", with error "[enforce fail at db.h:206] db. Cannot open db: /home/henry/caffe2_notebooks/tutorial_data/mnist/mnist-train-nchw-lmdb of type lmdb Error from operator: ..."

MatthewInkawhich commented 6 years ago

@henryguyu Perhaps this occurred when you were pasting error into your comment, but check your path. The MNIST db should be located at:

/home/henry/caffe2_notebooks/tutorial_data/mnist/mnist-train-nchw-lmdb

Notice mnist not minist, and mnist-train-nchw-lmdb not "minist-train-nchw-lmdb.

henryguyu commented 6 years ago

@MatthewInkawhich just my typo, anyway, thanks. I tried it with jupyter notebook, will take a try from terminal.