Closed jmortizs closed 8 years ago
Hello, you might find it easier to install Torch through .deb packages. If you wish to install from source, please follow the steps outlined in our documentation there. Note for example how the recommended command to install lightningmdb
differs from the command you quoted in your bug report.
Thank you for your response but I been dealing with this for a week and I think I fucked up everything. Can you please tell me how to remove everything related to Digits, even caffe/torch then I'll try a fresh reinstall.
Sorry I don't know what you've done as you don't seem to have followed the instructions so it's difficult to say how to undo it.
Greg, I've been seeing some funny Torch errors with TravisCI on my repo. Maybe they're related? Maybe something changed in Torch recently?
https://travis-ci.org/lukeyeager/DIGITS/jobs/144023101
The errors look related to LMDB. I'll try to find some time to dig into this today.
I am having trouble running the text-classification-example. The error I am getting is:
ERROR: /home/smhml/torch/install/share/lua/5.1/threads/threads.lua:183: [thread 1 callback] /home/smhml/digits/tools/torch/utils.lua:203: error loading module 'lightningmdb' from file '/home/smhml/torch/install/lib/lua/5.1/lightningmdb.so':
this is despite installing lightningmdb with this command and following the install instructions exactly:
luarocks install lightningmdb LMDB_INCDIR=/usr/include LMDB_LIBDIR=/usr/lib/x86_64-linux-gnu
Hi @cicero19 are you also seeing:
/home/travis/torch/install/lib/lua/5.1/lightningmdb.so: undefined symbol: mdb_txn_id
This seems to be a regression in lightningmdb 0.9.18.2-1.
Try doing:
luarocks install lightningmdb 0.9.18.1-1 LMDB_INCDIR=/usr/include LMDB_LIBDIR=/usr/lib/x86_64-linux-gnu
cc @shmul
This is unbelievable... I'd just remove and reinstall Ubuntu 14.04 and cuda/digits through .deb packages following instructions here https://github.com/NVIDIA/DIGITS/blob/master/docs/InstallCuda.md and https://github.com/NVIDIA/DIGITS/blob/master/docs/UbuntuInstall.md only to get the same error that it show at first:
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/flask/app.py", line 1475, in full_dispatch_request
rv = self.dispatch_request()
File "/usr/lib/python2.7/dist-packages/flask/app.py", line 1461, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "/usr/share/digits/digits/model/views.py", line 164, in models_visualize_network
ret = fw.get_network_visualization(flask.request.form['custom_network'])
File "/usr/share/digits/digits/frameworks/torch_framework.py", line 150, in get_network_visualization
raise NetworkVisualizationError(''.join(unrecognized_output))
NetworkVisualizationError: u"/usr/share/lua/5.1/hdf5/init.lua:15 Unable to find the HDF5 lib we were built against - trying to find it elsewhere\t\n/usr/bin/luajit: /usr/share/lua/5.1/trepl/init.lua:384: /usr/share/lua/5.1/trepl/init.lua:384: /usr/share/lua/5.1/hdf5/ffi.lua:29: libhdf5.so: cannot open shared object file: No such file or directory\nstack traceback:\n\t[C]: in function 'error'\n\t/usr/share/lua/5.1/trepl/init.lua:384: in function 'require'\n\t/usr/share/digits/tools/torch/main.lua:173: in main chunk\n\t[C]: in function 'dofile'\n\t/usr/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk\n\t[C]: at 0x00406670\n"
That happen when you try to visualize/train any torch model as I commented here: https://groups.google.com/forum/#!topic/digits-users/JkM18e4HVkM
We'll look into it.
In the meantime, you may find nvidia-docker to be a helpful solution if you can't figure out how to get your machine back to a clean state: https://github.com/NVIDIA/nvidia-docker/wiki/Installation https://github.com/NVIDIA/nvidia-docker/wiki/DIGITS
@gheinrich that seemed to solve the problem, but now running into illegal memory access issue:
ERROR: cuda runtime error (77) : an illegal memory access was encountered at /home/smhml/torch/extra/cutorch/lib/THC/generic/THCStorage.c:30
@cicero19 did you set mean subtraction
to none
when creating the model?
Hi,
I just replied to the lightningmdb thread - https://github.com/shmul/lightningmdb/issues/13 .
@gheinrich indeed setting mean subtraction to none solves this error. Thanks! Great addition to digits.
Solved an Error -> Gets new one... right now:
2016-07-16 10:53:32 [INFO ] Loading mean tensor from /usr/share/digits/digits/jobs/20160711-160447-22cb/mean.jpg file
2016-07-16 10:53:32 [INFO ] Loading label definitions from /usr/share/digits/digits/jobs/20160711-160447-22cb/labels.txt file
2016-07-16 10:53:32 [INFO ] found 2 categories
2016-07-16 10:53:32 [INFO ] creating data readers
2016-07-16 10:53:33 [INFO ] opening LMDB database: /usr/share/digits/digits/jobs/20160711-160447-22cb/train_db
2016-07-16 10:53:33 [INFO ] opening LMDB database: /usr/share/digits/digits/jobs/20160711-160447-22cb/train_db
2016-07-16 10:53:33 [INFO ] Image channels are 1, Image width is 42 and Image height is 42
2016-07-16 10:53:33 [INFO ] Image channels are 1, Image width is 42 and Image height is 42
2016-07-16 10:53:33 [INFO ] opening LMDB database: /usr/share/digits/digits/jobs/20160711-160447-22cb/train_db
2016-07-16 10:53:33 [INFO ] Image channels are 1, Image width is 42 and Image height is 42
2016-07-16 10:53:33 [INFO ] opening LMDB database: /usr/share/digits/digits/jobs/20160711-160447-22cb/train_db
2016-07-16 10:53:33 [INFO ] Image channels are 1, Image width is 42 and Image height is 42
2016-07-16 10:53:33 [INFO ] found 2376 images in train db/usr/share/digits/digits/jobs/20160711-160447-22cb/train_db
2016-07-16 10:53:33 [INFO ] opening LMDB database: /usr/share/digits/digits/jobs/20160711-160447-22cb/val_db
2016-07-16 10:53:33 [INFO ] opening LMDB database: /usr/share/digits/digits/jobs/20160711-160447-22cb/val_db
2016-07-16 10:53:33 [INFO ] opening LMDB database: /usr/share/digits/digits/jobs/20160711-160447-22cb/val_db
2016-07-16 10:53:33 [INFO ] Image channels are 1, Image width is 42 and Image height is 42
2016-07-16 10:53:33 [INFO ] Image channels are 1, Image width is 42 and Image height is 42
2016-07-16 10:53:33 [INFO ] Image channels are 1, Image width is 42 and Image height is 42
2016-07-16 10:53:33 [INFO ] opening LMDB database: /usr/share/digits/digits/jobs/20160711-160447-22cb/val_db
2016-07-16 10:53:33 [INFO ] Image channels are 1, Image width is 42 and Image height is 42
2016-07-16 10:53:33 [INFO ] found 792 images in train db/usr/share/digits/digits/jobs/20160711-160447-22cb/val_db
2016-07-16 10:53:33 [INFO ] Loading network definition from /usr/share/digits/digits/jobs/20160716-105331-96dc/model
Using CuDNN backend
2016-07-16 10:53:34 [INFO ] Train batch size is 10 and validation batch size is 10
2016-07-16 10:53:34 [INFO ] Network definition:
nn.Sequential {
[input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> output]
(1): cudnn.SpatialConvolution(1 -> 60, 3x3)
(2): cudnn.ReLU
(3): cudnn.SpatialMaxPooling(2x2, 2,2)
(4): cudnn.SpatialConvolution(60 -> 120, 3x3)
(5): cudnn.ReLU
(6): cudnn.SpatialMaxPooling(2x2, 2,2)
(7): nn.View(-1)
(8): nn.Linear(211680 -> 100)
(9): nn.Threshold
(10): nn.Dropout(0.500000)
(11): nn.Linear(100 -> 2)
(12): nn.LogSoftMax
}
2016-07-16 10:53:34 [INFO ] Network definition ends
2016-07-16 10:53:34 [INFO ] switching to CUDA
2016-07-16 10:53:35 [INFO ] initializing the parameters for learning rate policy: step
2016-07-16 10:53:35 [INFO ] initializing the parameters for Optimizer
2016-07-16 10:53:35 [INFO ] During training. details will be logged after every 297 images
2016-07-16 10:53:35 [INFO ] Training epochs to be completed for each validation : 1
2016-07-16 10:53:35 [INFO ] Training epochs to be completed before taking a snapshot : 1
2016-07-16 10:53:35 [INFO ] While logging, epoch value will be rounded to 3 significant digits
2016-07-16 10:53:35 [INFO ] Model weights will be saved as snapshot_<EPOCH>_Weights.t7
2016-07-16 10:53:35 [INFO ] started training the model
2016-07-16 10:53:35 [FAIL] /usr/share/lua/5.1/nn/Container.lua:67:
In 8 module of nn.Sequential:
/usr/share/lua/5.1/nn/Linear.lua:66: size mismatch at /tmp/buildd/torch7-0.9.98/extra/cutorch/lib/THC/THCTensorMathBlas.cu:90
stack traceback:
[C]: in function 'addmm'
/usr/share/lua/5.1/nn/Linear.lua:66: in function </usr/share/lua/5.1/nn/Linear.lua:53>
[C]: in function 'xpcall'
/usr/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
/usr/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
/usr/share/digits/tools/torch/main.lua:625: in function 'Validation'
/usr/share/digits/tools/torch/main.lua:782: in main chunk
[C]: in function 'dofile'
/usr/share/digits/tools/torch/wrapper.lua:25: in function </usr/share/digits/tools/torch/wrapper.lua:25>
[C]: in function 'xpcall'
/usr/share/digits/tools/torch/wrapper.lua:25: in main chunk
[C]: in function 'dofile'
/usr/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406670
WARNING: If you see a stack trace below, it doesn't point to the place where this error occured. Please use only the one above.
DIGITS Lua Error
stack traceback:
[C]: in function 'error'
/usr/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors'
/usr/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
/usr/share/digits/tools/torch/main.lua:625: in function 'Validation'
/usr/share/digits/tools/torch/main.lua:782: in main chunk
[C]: in function 'dofile'
/usr/share/digits/tools/torch/wrapper.lua:25: in function </usr/share/digits/tools/torch/wrapper.lua:25>
[C]: in function 'xpcall'
/usr/share/digits/tools/torch/wrapper.lua:25: in main chunk
[C]: in function 'dofile'
/usr/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406670
@jmortizs it looks like you didn't use exactly the data from the tutorial. If you change the feature length you need to change the dimensions of the linear layers accordingly.
Sorry, this has stressed me out and I didn't see that... now I'm feeling foolish, everything running fine now. Thank you.
I had made a fresh installation of ubuntu and DIGITS... before this I worked fine with caffe and torch but now when try to train a torch model (LeNet/AlexNet/Custom) I get the follow error: ERROR: /usr/share/lua/5.1/threads/threads.lua:183: [thread 3 callback] /usr/share/digits/tools/torch/utils.lua:232: error loading module 'lightningmdb' from file '/usr/lib/lua/5.1/lightningmdb.so'
Last thing I tried: ~$ sudo luarocks install lightningmdb Installing https://raw.githubusercontent.com/rocks-moonscript-org/moonrocks-mirror/master/lightningmdb-0.9.18.2-1.src.rock... Using https://raw.githubusercontent.com/rocks-moonscript-org/moonrocks-mirror/master/lightningmdb-0.9.18.2-1.src.rock... switching to 'build' mode
Error: Could not find library file for LMDB No file liblmdb.a in /usr/lib No file liblmdb.so in /usr/lib No file matching liblmdb.so.* in /usr/lib You may have to install LMDB in your system and/or pass LMDB_DIR or LMDB_LIBDIR to the luarocks command. Example: luarocks install lightningmdb LMDB_DIR=/usr/local