Streaming preprocessing, batch loading util

gzuidhof commented 9 years ago

Batch loading (`batchloader.py`)

Useful class (iterable) added that allows you to do:

for batch in BatchReader(batchsize=1000)
     print batch

Which takes batchsize patches from the file on each iteration.

Streaming preprocessing (`preprocess.py`)

Instead of keeping all the images in memory and processing them (list to list), it now loads one image, processes it, and writes it.

Overview:

1. Load all image paths into memory.
    2. Generate label tuples <classname (plankton type), filename, filepath>

    3. For each image:
        1. Load image from disk
        2. Pad or stretch image into squar
        3. Resize (downsize probably) to common size
        4. Patch image
        5. Flatten image from 2D to 1D    
        6. Write the results to file

It takes just over a minute for the train dataset (fast enough I suppose).

Miscellaneous

util class added
- Normalization function
- Fancy progress bar function,
- Simple unsupervised dataset loader (loads all into memory, which is quite big).
- Image flattening function (2D to 1D)

StevenReitsma commented 9 years ago

I would definitely add the omitted samples from the BatchReader since now we're just throwing data away. Rest is OK and can be merged.

gzuidhof commented 9 years ago

To do

- BatchLoader last iteration output remainder.
- impatch.npatch refactor

StevenReitsma commented 9 years ago

https://github.com/StevenReitsma/ml-in-practice/blob/00055f29b036a3ef3b05f5a8bfcf4879521e106a/src/batchreader.py#L44 will fail for batch size < self.batchsize

gzuidhof commented 9 years ago

It seems to work fine, it simply takes the remainder.

StevenReitsma / kaggle-national-data-science-bowl