NVIDIA / DIGITS

Deep Learning GPU Training System
https://developer.nvidia.com/digits
BSD 3-Clause "New" or "Revised" License
4.12k stars 1.38k forks source link

Error when training any Torch model #908

Closed jmortizs closed 8 years ago

jmortizs commented 8 years ago

I had made a fresh installation of ubuntu and DIGITS... before this I worked fine with caffe and torch but now when try to train a torch model (LeNet/AlexNet/Custom) I get the follow error: ERROR: /usr/share/lua/5.1/threads/threads.lua:183: [thread 3 callback] /usr/share/digits/tools/torch/utils.lua:232: error loading module 'lightningmdb' from file '/usr/lib/lua/5.1/lightningmdb.so'

Last thing I tried: ~$ sudo luarocks install lightningmdb Installing https://raw.githubusercontent.com/rocks-moonscript-org/moonrocks-mirror/master/lightningmdb-0.9.18.2-1.src.rock... Using https://raw.githubusercontent.com/rocks-moonscript-org/moonrocks-mirror/master/lightningmdb-0.9.18.2-1.src.rock... switching to 'build' mode

Error: Could not find library file for LMDB No file liblmdb.a in /usr/lib No file liblmdb.so in /usr/lib No file matching liblmdb.so.* in /usr/lib You may have to install LMDB in your system and/or pass LMDB_DIR or LMDB_LIBDIR to the luarocks command. Example: luarocks install lightningmdb LMDB_DIR=/usr/local

gheinrich commented 8 years ago

Hello, you might find it easier to install Torch through .deb packages. If you wish to install from source, please follow the steps outlined in our documentation there. Note for example how the recommended command to install lightningmdb differs from the command you quoted in your bug report.

jmortizs commented 8 years ago

Thank you for your response but I been dealing with this for a week and I think I fucked up everything. Can you please tell me how to remove everything related to Digits, even caffe/torch then I'll try a fresh reinstall.

gheinrich commented 8 years ago

Sorry I don't know what you've done as you don't seem to have followed the instructions so it's difficult to say how to undo it.

lukeyeager commented 8 years ago

Greg, I've been seeing some funny Torch errors with TravisCI on my repo. Maybe they're related? Maybe something changed in Torch recently?

https://travis-ci.org/lukeyeager/DIGITS/jobs/144023101

The errors look related to LMDB. I'll try to find some time to dig into this today.

cicero19 commented 8 years ago

I am having trouble running the text-classification-example. The error I am getting is:

ERROR: /home/smhml/torch/install/share/lua/5.1/threads/threads.lua:183: [thread 1 callback] /home/smhml/digits/tools/torch/utils.lua:203: error loading module 'lightningmdb' from file '/home/smhml/torch/install/lib/lua/5.1/lightningmdb.so':

this is despite installing lightningmdb with this command and following the install instructions exactly:

luarocks install lightningmdb LMDB_INCDIR=/usr/include LMDB_LIBDIR=/usr/lib/x86_64-linux-gnu

gheinrich commented 8 years ago

Hi @cicero19 are you also seeing:

/home/travis/torch/install/lib/lua/5.1/lightningmdb.so: undefined symbol: mdb_txn_id

This seems to be a regression in lightningmdb 0.9.18.2-1.

Try doing:

luarocks install lightningmdb 0.9.18.1-1 LMDB_INCDIR=/usr/include LMDB_LIBDIR=/usr/lib/x86_64-linux-gnu

cc @shmul

jmortizs commented 8 years ago

This is unbelievable... I'd just remove and reinstall Ubuntu 14.04 and cuda/digits through .deb packages following instructions here https://github.com/NVIDIA/DIGITS/blob/master/docs/InstallCuda.md and https://github.com/NVIDIA/DIGITS/blob/master/docs/UbuntuInstall.md only to get the same error that it show at first:

Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/flask/app.py", line 1475, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/lib/python2.7/dist-packages/flask/app.py", line 1461, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/usr/share/digits/digits/model/views.py", line 164, in models_visualize_network
    ret = fw.get_network_visualization(flask.request.form['custom_network'])
  File "/usr/share/digits/digits/frameworks/torch_framework.py", line 150, in get_network_visualization
    raise NetworkVisualizationError(''.join(unrecognized_output))
NetworkVisualizationError: u"/usr/share/lua/5.1/hdf5/init.lua:15 Unable to find the HDF5 lib we were built against - trying to find it elsewhere\t\n/usr/bin/luajit: /usr/share/lua/5.1/trepl/init.lua:384: /usr/share/lua/5.1/trepl/init.lua:384: /usr/share/lua/5.1/hdf5/ffi.lua:29: libhdf5.so: cannot open shared object file: No such file or directory\nstack traceback:\n\t[C]: in function 'error'\n\t/usr/share/lua/5.1/trepl/init.lua:384: in function 'require'\n\t/usr/share/digits/tools/torch/main.lua:173: in main chunk\n\t[C]: in function 'dofile'\n\t/usr/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk\n\t[C]: at 0x00406670\n"

That happen when you try to visualize/train any torch model as I commented here: https://groups.google.com/forum/#!topic/digits-users/JkM18e4HVkM

lukeyeager commented 8 years ago

We'll look into it.

In the meantime, you may find nvidia-docker to be a helpful solution if you can't figure out how to get your machine back to a clean state: https://github.com/NVIDIA/nvidia-docker/wiki/Installation https://github.com/NVIDIA/nvidia-docker/wiki/DIGITS

cicero19 commented 8 years ago

@gheinrich that seemed to solve the problem, but now running into illegal memory access issue:

ERROR: cuda runtime error (77) : an illegal memory access was encountered at /home/smhml/torch/extra/cutorch/lib/THC/generic/THCStorage.c:30

gheinrich commented 8 years ago

@cicero19 did you set mean subtraction to none when creating the model?

shmul commented 8 years ago

Hi,

I just replied to the lightningmdb thread - https://github.com/shmul/lightningmdb/issues/13 .

cicero19 commented 8 years ago

@gheinrich indeed setting mean subtraction to none solves this error. Thanks! Great addition to digits.

jmortizs commented 8 years ago

Solved an Error -> Gets new one... right now:

2016-07-16 10:53:32 [INFO ] Loading mean tensor from /usr/share/digits/digits/jobs/20160711-160447-22cb/mean.jpg file
2016-07-16 10:53:32 [INFO ] Loading label definitions from /usr/share/digits/digits/jobs/20160711-160447-22cb/labels.txt file
2016-07-16 10:53:32 [INFO ] found 2 categories
2016-07-16 10:53:32 [INFO ] creating data readers
2016-07-16 10:53:33 [INFO ] opening LMDB database: /usr/share/digits/digits/jobs/20160711-160447-22cb/train_db
2016-07-16 10:53:33 [INFO ] opening LMDB database: /usr/share/digits/digits/jobs/20160711-160447-22cb/train_db
2016-07-16 10:53:33 [INFO ] Image channels are 1, Image width is 42 and Image height is 42
2016-07-16 10:53:33 [INFO ] Image channels are 1, Image width is 42 and Image height is 42
2016-07-16 10:53:33 [INFO ] opening LMDB database: /usr/share/digits/digits/jobs/20160711-160447-22cb/train_db
2016-07-16 10:53:33 [INFO ] Image channels are 1, Image width is 42 and Image height is 42
2016-07-16 10:53:33 [INFO ] opening LMDB database: /usr/share/digits/digits/jobs/20160711-160447-22cb/train_db
2016-07-16 10:53:33 [INFO ] Image channels are 1, Image width is 42 and Image height is 42
2016-07-16 10:53:33 [INFO ] found 2376 images in train db/usr/share/digits/digits/jobs/20160711-160447-22cb/train_db
2016-07-16 10:53:33 [INFO ] opening LMDB database: /usr/share/digits/digits/jobs/20160711-160447-22cb/val_db
2016-07-16 10:53:33 [INFO ] opening LMDB database: /usr/share/digits/digits/jobs/20160711-160447-22cb/val_db
2016-07-16 10:53:33 [INFO ] opening LMDB database: /usr/share/digits/digits/jobs/20160711-160447-22cb/val_db
2016-07-16 10:53:33 [INFO ] Image channels are 1, Image width is 42 and Image height is 42
2016-07-16 10:53:33 [INFO ] Image channels are 1, Image width is 42 and Image height is 42
2016-07-16 10:53:33 [INFO ] Image channels are 1, Image width is 42 and Image height is 42
2016-07-16 10:53:33 [INFO ] opening LMDB database: /usr/share/digits/digits/jobs/20160711-160447-22cb/val_db
2016-07-16 10:53:33 [INFO ] Image channels are 1, Image width is 42 and Image height is 42
2016-07-16 10:53:33 [INFO ] found 792 images in train db/usr/share/digits/digits/jobs/20160711-160447-22cb/val_db
2016-07-16 10:53:33 [INFO ] Loading network definition from /usr/share/digits/digits/jobs/20160716-105331-96dc/model
Using CuDNN backend
2016-07-16 10:53:34 [INFO ] Train batch size is 10 and validation batch size is 10
2016-07-16 10:53:34 [INFO ] Network definition:
nn.Sequential {
[input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> output]
(1): cudnn.SpatialConvolution(1 -> 60, 3x3)
(2): cudnn.ReLU
(3): cudnn.SpatialMaxPooling(2x2, 2,2)
(4): cudnn.SpatialConvolution(60 -> 120, 3x3)
(5): cudnn.ReLU
(6): cudnn.SpatialMaxPooling(2x2, 2,2)
(7): nn.View(-1)
(8): nn.Linear(211680 -> 100)
(9): nn.Threshold
(10): nn.Dropout(0.500000)
(11): nn.Linear(100 -> 2)
(12): nn.LogSoftMax
}
2016-07-16 10:53:34 [INFO ] Network definition ends
2016-07-16 10:53:34 [INFO ] switching to CUDA
2016-07-16 10:53:35 [INFO ] initializing the parameters for learning rate policy: step
2016-07-16 10:53:35 [INFO ] initializing the parameters for Optimizer
2016-07-16 10:53:35 [INFO ] During training. details will be logged after every 297 images
2016-07-16 10:53:35 [INFO ] Training epochs to be completed for each validation : 1
2016-07-16 10:53:35 [INFO ] Training epochs to be completed before taking a snapshot : 1
2016-07-16 10:53:35 [INFO ] While logging, epoch value will be rounded to 3 significant digits
2016-07-16 10:53:35 [INFO ] Model weights will be saved as snapshot_<EPOCH>_Weights.t7
2016-07-16 10:53:35 [INFO ] started training the model
2016-07-16 10:53:35 [FAIL] /usr/share/lua/5.1/nn/Container.lua:67:
In 8 module of nn.Sequential:
/usr/share/lua/5.1/nn/Linear.lua:66: size mismatch at /tmp/buildd/torch7-0.9.98/extra/cutorch/lib/THC/THCTensorMathBlas.cu:90
stack traceback:
[C]: in function 'addmm'
/usr/share/lua/5.1/nn/Linear.lua:66: in function </usr/share/lua/5.1/nn/Linear.lua:53>
[C]: in function 'xpcall'
/usr/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
/usr/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
/usr/share/digits/tools/torch/main.lua:625: in function 'Validation'
/usr/share/digits/tools/torch/main.lua:782: in main chunk
[C]: in function 'dofile'
/usr/share/digits/tools/torch/wrapper.lua:25: in function </usr/share/digits/tools/torch/wrapper.lua:25>
[C]: in function 'xpcall'
/usr/share/digits/tools/torch/wrapper.lua:25: in main chunk
[C]: in function 'dofile'
/usr/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406670
WARNING: If you see a stack trace below, it doesn't point to the place where this error occured. Please use only the one above.
DIGITS Lua Error
stack traceback:
[C]: in function 'error'
/usr/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors'
/usr/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
/usr/share/digits/tools/torch/main.lua:625: in function 'Validation'
/usr/share/digits/tools/torch/main.lua:782: in main chunk
[C]: in function 'dofile'
/usr/share/digits/tools/torch/wrapper.lua:25: in function </usr/share/digits/tools/torch/wrapper.lua:25>
[C]: in function 'xpcall'
/usr/share/digits/tools/torch/wrapper.lua:25: in main chunk
[C]: in function 'dofile'
/usr/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406670
gheinrich commented 8 years ago

@jmortizs it looks like you didn't use exactly the data from the tutorial. If you change the feature length you need to change the dimensions of the linear layers accordingly.

jmortizs commented 8 years ago

Sorry, this has stressed me out and I didn't see that... now I'm feeling foolish, everything running fine now. Thank you.