StevenReitsma / kaggle-national-data-science-bowl

National Data Science Bowl competition entry for the Best Whale Wow team of the Radboud University Nijmegen. We ended 68th.
1 stars 0 forks source link

Streaming preprocessing, batch loading util #9

Closed gzuidhof closed 9 years ago

gzuidhof commented 9 years ago

Batch loading (batchloader.py)

Useful class (iterable) added that allows you to do:

for batch in BatchReader(batchsize=1000)
     print batch

Which takes batchsize patches from the file on each iteration.

Streaming preprocessing (preprocess.py)

Instead of keeping all the images in memory and processing them (list to list), it now loads one image, processes it, and writes it.

Overview:

1. Load all image paths into memory.
    2. Generate label tuples <classname (plankton type), filename, filepath>

    3. For each image:
        1. Load image from disk
        2. Pad or stretch image into squar
        3. Resize (downsize probably) to common size
        4. Patch image
        5. Flatten image from 2D to 1D    
        6. Write the results to file

It takes just over a minute for the train dataset (fast enough I suppose).

Miscellaneous

StevenReitsma commented 9 years ago

I would definitely add the omitted samples from the BatchReader since now we're just throwing data away. Rest is OK and can be merged.

gzuidhof commented 9 years ago

To do

- BatchLoader last iteration output remainder.
- impatch.npatch refactor
StevenReitsma commented 9 years ago

https://github.com/StevenReitsma/ml-in-practice/blob/00055f29b036a3ef3b05f5a8bfcf4879521e106a/src/batchreader.py#L44 will fail for batch size < self.batchsize

gzuidhof commented 9 years ago

It seems to work fine, it simply takes the remainder.