StevenReitsma / kaggle-national-data-science-bowl

National Data Science Bowl competition entry for the Best Whale Wow team of the Radboud University Nijmegen. We ended 68th.
1 stars 0 forks source link

Convolutional neural network #27

Closed StevenReitsma closed 9 years ago

StevenReitsma commented 9 years ago

About time this is merged to master!

The most important part is the src/lasagne directory, in which all necessary training and prediction files are placed. A modified version of the relevant files of the nolearn library is placed in src/lasagne/nolearn.

augmenter.py

Rotates, scales, translates and zooms a given data set with some given parameters.

combiner.py

Performs a linear blend over several Kaggle submission files. This is currently used for model averaging.

custom_layers.py

Contains our custom CNN layers that are not included in lasagne. Currently only contains two: SliceLayer and MergeLayer. SliceLayer slices images in the minibatch in a way similar to this. Slices are rotations and flips of the original images. Then, the network is trained on every slice to increase parameter sharing and thus reduce overfitting. Features are generated for each of the slices. MergeLayer combines these features and flattens them. The output of the MergeLayer is used as input in the DenseLayers at the end of the network.

early_stopping.py

Defines criteria for early stopping when validation loss is not decreasing anymore. Prevents overlearning.

gen_test.py

Preprocessing for test images.

gen_train.py

Preprocessing for train images.

imageio.py

Writes images to binary file to speed up training. HDF5 is used as file format.

iterators.py

Contains some custom iterators that nolearn was lacking: ShufflingIterator (shuffles minibatch each epoch), DataAugmentationBatchIterator (augments image with rotations etc.) and ScalingBatchIterator (scales images).

learning_rate.py

Adjusts learning rate in a simulated annealing way.

modelsaver.py

Saves the model every X epochs to disk.

params.py

Program parameters. Does not include network parameters (these are defined in train_concurrent.py).

predict.py

Predicts test image distribution using a trained network model. Uses diversive prediction: every image is rotated and flipped, and for each of these image forms, predictions are made. These are then averaged uniformly. This increases our score significantly.

refit.py

Refits an already trained network on the complete data set (training + validation). This improves the score just a tiny bit, apparently 80% of the data is enough to reach the score that we currently have.

train_concurrent.py

Network definition file and starts the training loop.

util.py

Theano utils.

visualize.py

Visualizes filters in the first layer of the network. Visualizing other layers is less interesting.

StevenReitsma commented 9 years ago

Todo: remove duplicate code in DataAugmentationBatchIterator and Augmenter class.

gzuidhof commented 9 years ago

Reviewing (and grasping) this will take me a short while Could you merge the master branch into this feature branch?

gzuidhof commented 9 years ago

Alright, so far I skimmed over most files under the src/lasagne.

I can't run much of it out of the box, as unix commands are used here and there. I could emulate most using cygwin, but I suppose it's easier to look into installing some linux distro on a external disk like you have.

For tonight I looked in particular at gen_train.py, gen_test.py and imagio.py.

In gen_*.py I found one unused import and some unclear variable names, but I can live with that.


The WTF's per minute in imageio.py was fairly high. WTFs per minute comic

Excerpt from imageio.py

train_subdirectories = list(set(glob.glob(os.path.join(IMAGE_SOURCE, "train", "*"))\
         ).difference(set(glob.glob(os.path.join(IMAGE_SOURCE,"train","*.*")))))

After a beer I was able to work out that in the first statement you create a unique list of train data subdirectories by first recursively matching files/folders to some regular expression. This regular expression however yields both files and folders, so the difference of this set is calculated to a second set which only contains the files.

        numberofImages = 0
        for folder in train_subdirectories:
            for fileNameDir in os.walk(folder):
                for fileName in fileNameDir[2]:
                     # Only read in the images
                    if fileName[-4:] != ".jpg":
                      continue
                    numberofImages += 1

Here you seem to be counting the number of images, I guess?

Anyway, please understand I am not trying to bash your code. I understand this part is just a hack compared to the meat of this branch as it was not the focus.

The thing is is that I have written some code that does practically the same as the above ImageIO._load_train_images_from_disk function in a more sane way.

I think that with some small modifications (saving not only the patches but also the processed images, could be to a different file) the preprocessing can be shared between this CNN approach and the K-Means/RBM.

What I would like to propose is to DRY up the codebase, and have both use the same basic preprocessing code. Tomorrow (friday) I should have time to look into achieving this (perhaps you are available for questions then?).

I am interested in your thoughts on this

gzuidhof commented 9 years ago

The src/cxxnet subfolder is part of this merge request, I don't think this was intended?

This folder contains some config files which refer to paths on your specific machine. Perhaps you could ignore this folder and remove them from this pull request?

You should be able to dig them back looking at previous versions of the deep branch.

StevenReitsma commented 9 years ago

Yeah so like I said, large parts of the code are copied from others. We can definitely look into making some of the code more readable but I wouldn't do this before the competition deadline. I would recommend you to work on the unsupervised feature extraction and not on 'drying' the code (or whatever) since that part is still not working. While the CNN code might not be the neatest, at least it works.

You're going to have a bad time trying to actually run the code by the way, since it has a lot of (undocumented) dependencies. Might be a better idea to do that after the deadline as well, if you're still interested.