benanne / kaggle-ndsb

Winning solution for the National Data Science Bowl competition on Kaggle (plankton classification)
MIT License
566 stars 199 forks source link

There is a problem when runing the command "python train_convnet.py convroll4" #2

Closed lilinghai closed 9 years ago

lilinghai commented 9 years ago

How to solve it? E:\kaggle-ndsb>python train_convnet.py convroll4 Traceback (most recent call last): File "train_convnet.py", line 23, in import nn_plankton File "E:\kaggle-ndsb\nn_plankton.py", line 161, in class BatchInterleaveLayer(nn.layers.MultipleInputsLayer): AttributeError: 'module' object has no attribute 'MultipleInputsLayer'

benanne commented 9 years ago

The version of Lasagne you are using is probably too new :) This code was written when Lasagne was still under heavy development, and it has not been updated since. MultipleInputsLayer has been renamed to MergeLayer, for example.

The easiest solution is just to install commit f445b71 (probably best to do it in a virtualenv or so), the code is known to work with that version. Instructions on how to do this are in the documentation: https://github.com/benanne/kaggle-ndsb/blob/master/doc.pdf (Section 5.2, software dependencies).

mccallm-cgrb commented 9 years ago

Hi,

I am working for the OSU plankton lab and ran into the same error with an updated version of Lasagne, but the commit you've linked to (f445b71) seems to either be broken or non-existent.

When trying the command from the documentation, I received the following output:

$ pip install git+git://github.com/benanne/Lasagne.git@f445b71 Downloading/unpacking git+git://github.com/benanne/Lasagne.git@f445b71 Cloning git://github.com/benanne/Lasagne.git (to f445b71) to /tmp/pip-Z5sdvQ-build Could not find a tag or branch 'f445b71', assuming commit.

Do you know if the link is broken or possibly a typo? Any help getting the correct version would be greatly appreciated.

Thanks, Miles

benanne commented 9 years ago

I don't see any problems with the output you pasted. It says "assuming commit" which is what it's supposed to do. What is the exact issue you're facing?

mccallm-cgrb commented 9 years ago

The command is able to download Lasagne successfully, but doesn't seem to find the specific version mentioned. While trying to generate the solution by running the setup python files, "train_convnet.py" throws errors regarding references to Lasagne in the file. For example, 'MultipleInputsLayer' being renamed 'MergeLayer' in later versions.

Here is the current output of "train_convnet.py" for reference:

Traceback (most recent call last): File "train_convnet.py", line 44, in model = config.build_model() File "/home/planktonlab/Downloads/kaggle-ndsb-master/configurations/convroll4.py", line 79, in build_model l1a = Conv2DLayer(l0c, num_filters=32, filter_size=(3, 3), border_mode="same", W=nn_plankton.Conv2DOrthogonal(1.0), b=nn.init.Constant(0.1), nonlinearity=nn_plankton.leaky_relu) File "/home/planktonlab/Downloads/kaggle-ndsb-master/tmp_dnn.py", line 109, in init self.W = self.create_param(W, self.get_W_shape()) AttributeError: 'Conv2DDNNLayer' object has no attribute 'create_param'

The errors appear to be related to the version. Have you seen these before?

benanne commented 9 years ago

Yeah, from those errors it looks like it's installing a more recent version anyway. It's weird because the output you posted before says "assuming commit", so it actually looks like it's doing what it's supposed to do. Are you sure you don't have multiple instances of Lasagne installed? Maybe it's importing a different copy?

mccallm-cgrb commented 9 years ago

Thanks for the help! After a somewhat fresh installation of the dependencies, Theano and Lasagne seem to be working and we've moved past this error.

One more quick question:

After creating the data files and validation splits, I got stopped while executing the train_convnet.py script with an error stating Lasagne couldn't find cudnn (output below).


miles@CGRB-Desktop:~/Downloads/kaggle-ndsb-master$ python train_convnet.py convroll4 using default validation split: validation_split_v1.pkl

Experiment ID: convroll4-CGRB-Desktop-20150813-141142

Build model number of parameters: 5475049 layer output shapes: DenseLayer (32, 121) DropoutLayer (32, 256) CyclicPoolLayer (32, 256) DenseLayer (128, 256) DropoutLayer (128, 1024) CyclicRollLayer (128, 1024) DenseLayer (128, 256) DropoutLayer (128, 12800) FlattenLayer (128, 12800) CyclicConvRollLayer (128, 512, 5, 5) MaxPool2DDNNLayer (128, 128, 5, 5) Conv2DDNNLayer (128, 128, 11, 11) Conv2DDNNLayer (128, 256, 11, 11) Conv2DDNNLayer (128, 256, 11, 11) CyclicConvRollLayer (128, 256, 11, 11) MaxPool2DDNNLayer (128, 64, 11, 11) Conv2DDNNLayer (128, 64, 23, 23) Conv2DDNNLayer (128, 128, 23, 23) Conv2DDNNLayer (128, 128, 23, 23) CyclicConvRollLayer (128, 128, 23, 23) MaxPool2DDNNLayer (128, 32, 23, 23) Conv2DDNNLayer (128, 32, 47, 47) Conv2DDNNLayer (128, 64, 47, 47) CyclicConvRollLayer (128, 64, 47, 47) MaxPool2DDNNLayer (128, 16, 47, 47) Conv2DDNNLayer (128, 16, 95, 95) Conv2DDNNLayer (128, 32, 95, 95) CyclicSliceLayer (128, 1, 95, 95) InputLayer (32, 1, 95, 95) Traceback (most recent call last): File "train_convnet.py", line 70, in train_loss = obj.get_loss() File "/usr/local/lib/python2.7/dist-packages/lasagne/objectives.py", line 92, in get_loss network_output = self.input_layer.get_output(input, _args, _kwargs) File "/usr/local/lib/python2.7/dist-packages/lasagne/layers/base.py", line 129, in get_output layer_input = self.input_layer.get_output(input, _args, _kwargs) /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\ (cut out redundant lines) /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\ File "/usr/local/lib/python2.7/dist-packages/lasagne/layers/base.py", line 130, in get_output return self.get_output_for(layer_input, _args, *_kwargs) File "/home/miles/Downloads/kaggle-ndsb-master/tmp_dnn.py", line 137, in get_output_for raise RuntimeError("cudnn is not available.") RuntimeError: cudnn is not available.


I followed this tutorial to install cudnn as it seamed simpler and more streamlined than other suggestions. While I tried all three options in the guide, none seemed to change the errors.
http://deeplearning.net/software/theano/library/sandbox/cuda/dnn.html

I also noticed several guides in which cudnn is built around or inside of Caffe. https://github.com/tiangolo/caffe/blob/ubuntu-tutorial-b/docs/install_apt2.md https://github.com/BVLC/caffe/wiki/Install-Caffe-on-EC2-from-scratch-%28Ubuntu,-CUDA-7,-cuDNN%29

The Deep Sea documentation doesn't mention anything about this, but is this an error you have come across regarding cudnn? If simply adding the cudnn files into the CUDA directories should work I will continue to mess with it, otherwise I will take the Caffe route.

Thanks again.

benanne commented 9 years ago

These two notes from the Theano docs may be relevant:

So make sure you're not doing either of those :) There's also a thread on the mailing list that discusses the installation for Theano specifically: https://groups.google.com/forum/#!topic/theano-users/TKzDReD5v5I

mccallm-cgrb commented 9 years ago

Hello again!

Thank you very much for the advice for the last few issues. Since then we have successfully trained all of the basic models except one: convroll_all_broaden_7x7_weightdecay_resume.

After train_convnet.py outputs the build model, we receive the following error:

Load model parameters for resuming Traceback (most recent call last): File "train_convnet.py", line 114, in resume_metadata = np.load(config.resume_path) File "/usr/lib/python2.7/dist-packages/numpy/lib/npyio.py", line 370, in load fid = open(file, "rb") IOError: [Errno 2] No such file or directory: 'metadata/convroll_all_broaden_7x7_weightdecay-paard-20150219-135707.pkl' PyCUDA WARNING: a clean-up operation failed (dead context maybe?) cuModuleUnload failed: context is destroyed PyCUDA WARNING: a clean-up operation failed (dead context maybe?) cuModuleUnload failed: context is destroyed PyCUDA WARNING: a clean-up operation failed (dead context maybe?) cuModuleUnload failed: context is destroyed PyCUDA WARNING: a clean-up operation failed (dead context maybe?) cuModuleUnload failed: context is destroyed Segmentation fault (core dumped)

The error does note that the file doesn't exist so I tried again after manually creating the file in metadata (multiple reasons this could go wrong but worth a shot) and received this instead:

Load model parameters for resuming Traceback (most recent call last): File "train_convnet.py", line 114, in resume_metadata = np.load(config.resume_path) File "/usr/lib/python2.7/dist-packages/numpy/lib/npyio.py", line 382, in load fid.seek(-N, 1) # back-up IOError: [Errno 22] Invalid argument PyCUDA WARNING: a clean-up operation failed (dead context maybe?) cuModuleUnload failed: context is destroyed PyCUDA WARNING: a clean-up operation failed (dead context maybe?) cuModuleUnload failed: context is destroyed Segmentation fault (core dumped)

This seems like an odd place to get stuck as the rest of the models appeared to run similarly and completed just fine. I had little luck searching for the same errors with Python... Is this something you've come across before? Thanks again in advance.

benanne commented 9 years ago

This config was resumed from a crashed training run (hence the _resume). So it needs to load up the most recent data file from this training run, which has an exact timestamp. You need to supply a valid pickle file with the model data/metadata for this to work. It looks like you are providing an invalid file.

But if you're lucky your training run will not have crashed to begin with, so then you don't need to both with resuming anyway.

mccallm-cgrb commented 9 years ago

Would this imply that the files associated with this specific module in the data and/or configurations folders became corrupt at some point during the training? As I understand the scripts so far, the .pkl files in metadata are generated by train_convnet.py. If I need a data file with a valid time stamp would it make sense to pull down the source again and try again?

benanne commented 9 years ago

no, I think there's a misunderstanding. The _resume configs are configs we created when a specific training run crashed, and we wanted to continue from the last checkpoint. There is no reason for you to reproduce this exactly if the original training run doesn't crash (which I hope it doesn't). In short, you should not be using the _resume configs at all. The metadata files they reference are not included in the repository anyway.

mccallm-cgrb commented 9 years ago

Ah that makes much more sense, and would explain quite a bit. I'm ruuning out of time today, but that should fix our issues curently. I'll post an update when it's all running.

Thanks again for all of your help.

milushev commented 9 years ago

Hi,

I am trying to run your code in a virtualenv.

I get this: Experiment ID: convroll4-sunrise-20150916-083108

Build model Traceback (most recent call last): File "train_convnet.py", line 44, in model = config.build_model() File "/home/dimiter/Kaggle/Plankton/plankton/configurations/convroll4.py", line 108, in build_model l7 = nn.layers.DenseLayer(nn.layers.dropout(l6m, p=0.5), num_units=data.num_classes, nonlinearity=T.nnet.softmax, W=nn_plankton.Orthogonal(1.0)) File "/home/dimiter/Kaggle/Plankton/plankton/local/lib/python2.7/site-packages/lasagne/layers/dense.py", line 67, in init self.W = self.create_param(W, (num_inputs, num_units), name="W") File "/home/dimiter/Kaggle/Plankton/plankton/local/lib/python2.7/site-packages/lasagne/layers/base.py", line 233, in create_param arr = param(shape) File "/home/dimiter/Kaggle/Plankton/plankton/local/lib/python2.7/site-packages/lasagne/init.py", line 14, in call return self.sample(shape) File "/home/dimiter/Kaggle/Plankton/plankton/nnplankton.py", line 43, in sample u, , v = np.linalg.svd(a, full_matrices=False) File "/home/dimiter/Kaggle/Plankton/plankton/local/lib/python2.7/site-packages/numpy/linalg/linalg.py", line 1306, in svd _assertNoEmpty2d(a) File "/home/dimiter/Kaggle/Plankton/plankton/local/lib/python2.7/site-packages/numpy/linalg/linalg.py", line 222, in _assertNoEmpty2d raise LinAlgError("Arrays cannot be empty") numpy.linalg.linalg.LinAlgError: Arrays cannot be empty

Any ideas what causes the error?

milushev commented 9 years ago

Could it be thet I have Lasagne==0.1.dev0 (got installed with pip install git+git://github.com/benanne/Lasagne.git@f445b71)? Is it too late a version?

benanne commented 9 years ago

did you modify the config in any way? The version of Lasagne that you should install for this code to work unchanged is specified in the documentation, so you can check there. I don't know the commit hash by heart. Newer versions are unlikely to work without changes to the code.

milushev commented 9 years ago

Nope, the config is not changed at all. Everything is run in a virtualenv. The Lasagne version should also be OK.

benanne commented 9 years ago

From the error it sounds like an initializer is receiving a shape with a 0 in it as input, which means that one of the layers thinks that its weight matrix is empty. This is obviously wrong. Maybe you could check the output shapes / parameter shapes for all the layers?