amaiya / ktrain

ktrain is a Python library that makes deep learning and AI more accessible and easier to apply
Apache License 2.0
1.23k stars 269 forks source link

Cannot get learner from iterator zip object #17

Closed gkaissis closed 4 years ago

gkaissis commented 5 years ago

get_learner fails when the training data is a zip of iterators such as when it is used for image segmentation tasks (while augmenting images and masks together).

EDIT:

It works by hacking together a custom Iterator class, but it's not a particularly elegant hack...

image_gen and mask_gen below are keras.preprocessing.image.ImageDataGenerator.flow_from_directory() objects.


class Iterator():

    def __init__(self, image_gen, mask_gen):
        self.image_gen = image_gen
        self.mask_gen = mask_gen
        self.batch_size = image_gen.batch_size
        self.target_size = image_gen.target_size
        self.color_mode = image_gen.color_mode
        self.class_mode = image_gen.class_mode
        self.n = image_gen.n
        self.seed = image_gen.seed
        self.total_batches_seen = image_gen.total_batches_seen

    def __iter__(self):
        return self

    def __next__(self):
        return next(self.image_gen), next(self.mask_gen)

    def __getitem__(self, key):
        return self.image_gen[key], self.mask_gen[key]

Any ideas how we could make this more elegant?

amaiya commented 5 years ago

Great question and something that's been on my radar. An elegant solution would probably be to create a flexible layer of abstraction around datasets within ktrain itself to better support such scenarios.

I'll plan on looking into this and may use image segmentation as the "guinea pig" task. Thanks for letting me know.

gkaissis commented 5 years ago

Since I detect a whiff of FAST.AI in ktrain, I suppose you are referring to something along the lines of a Dataset(or its PyTorch equivalent)? That would allow us to separate the learner from its data.

amaiya commented 5 years ago

Yes, exactly.

Generally, I've rightly or wrongly tried to limit the amount of new object abstractions added on top of Keras, but a data abstraction layer would seem to make sense here and help avoid some ugly situations.

gkaissis commented 5 years ago

I think this is part of a larger software engineering discussion. Abstractions are completely justified if they are clear to the user. This is where I feel fast.ai took a wrong turn. It takes the PyTorch concept of separating the data, the way the data is fed to the model and the model itself and convolutes it into an unintuitive, poorly documented "DataBunch", full of unclear abbreviations inside function calls (freeze_wd? bs? detach? denorm?), opaque, unintuitive docstrings (Get one batch from the data loader of ds_type. Optionally detach and denorm.), and behind-the-scenes interventions (setting dropout, setting weight decay, choosing training policies etc.), which are neither communicated clearly to the user, nor justified. Therefore, every time one needs to fulfil some specific goal with the library which is even slightly outside its core scope (e.g. medical image segmentation with single channel images, a 3D CNN etc.), iit becomes necessary to start writing custom classes and tinkering around in the engine room to remove the "state of the art choices" made by the library thus completely reversing the premise of "it just works". You have gone a very commendable way by just thinly wrapping Keras, one of the most transparent and well-written API specs ever created, and this carries over 1:1 to ktrain. It fundamentally serves to assist the experienced practitioner, looking for something beyond the basic "model.fit" idea to do state-of-the-art deep learning by providing helpful functions sitting atop inbuilt keras features, which is in my opinion a friendlier way to work, even for a beginner coming from Keras directly and looking to "up the game". Thus, regarding the data abstraction layer, it could disappear in the background, kinda like the current ArrayLearner is (basically create an ArrayToArray learner for generative modelling of any kind, segmentation, image restoration etc.) and thus keeping the training process centred on the learner object instead of splitting it up between the data and its augmentations, the way it is loaded (in memory vs. streamed), and the processing algorithm.

amaiya commented 5 years ago

Great points and I liked the ArrayToArray example.

Some problems might, in fact, require multiple input arrays. A fake news detector, for instance, might require an array for the text, a 1-hot-encoded array for the news source, and a 1-hot-encoded array for author - each of which is fed to its own embedding layer. Supporting these kinds of custom datasets and models should be easily supported within ktrain in the way you described and as seamlessly and painlessly as possible.

amaiya commented 5 years ago

@gkaissis Better and more flexible support for alternative data formats (such as those used in your image segmentation example) is expected in ktrain ~v0.6.0~ v0.9.0. As of right now, I anticipate that this can be handled with some refactoring and extensions to the Learner hierarchy in a way that is mostly seamless to users or, as you said, in a way that "disappears in the background".

gkaissis commented 5 years ago

Sounds amazing. As I am under extreme pressure from work at the moment, would it be OK if I contributed some documentation/ docstring typos, maybe a tutorial or something for the time being? Aiming to contribute more actively as soon as I have more time (at the moment I am working 7 days a week...).

amaiya commented 5 years ago

Help is always welcome - feel free as you have time. But, if you're working 7 days a week right now, I'd recommend waiting until things ease up, as your life sounds pretty stressful right now. :) Good luck!

rousseau commented 4 years ago

Hello, I think my question is somewhat related to the topic of this issue. I try to use multiple inputs and multiple outputs with ktrain. I get the ValueError that "data must be tuple of numpy.ndarrays or an instance of Iterator".

Is there currently a way to use ktrain with multiple inputs / outputs (train_data = ([x1,x2],[y1,y2,y3]) ?

amaiya commented 4 years ago

@rousseau Yes, your question is very much related to this same issue.

Supporting this was originally intended to be released earlier but was delayed. It is now expected to be released as the next major update (i.e., ktrain-v0.9).

These updates should be able to support both the image segmentation example above in addition to your multi-input/multi-output use case. This issue will be updated when it is released (or when a pre-release is made for testing). Thanks.

rousseau commented 4 years ago

Thank you very much for your quick answer. Can't wait for version 0.9! :-) In the meantime, could the use of a generator work?

amaiya commented 4 years ago

@rousseau In ktrain v0.8.2 (just released), there is a minimalistic patch in the form of an abstract class ktrain.Dataset. If ktrain.Dataset is subclassed and the subclass wraps your dataset and implements the required methods, this may satisfy some use cases related to custom datasets and models in the interim. Here is a toy example for how to use it to support arbitrary multi-array inputs:

# imports
from tensorflow.keras.layers import *
from tensorflow.keras.models import Sequential, Model
import math
import numpy as np
import ktrain

# sequence wrapper for custom dataset
class MultiArrayDataset(ktrain.Dataset):
    def __init__(self, x, y, batch_size=32):
        if type(x) != np.ndarray or type(y) != np.ndarray:
            raise ValueError('x and y must be numpy arrays')
        if len(x.shape) != 3:
            raise valueError('x must have 3 dimensions')
        super().__init__(batch_size=batch_size)
        self.x, self.y = x, y
        self.indices = np.arange(self.x[0].shape[0])
        self.n_inputs = x.shape[0]

    def __len__(self):
        return math.ceil(self.x[0].shape[0] / self.batch_size)

    def __getitem__(self, idx):
        inds = self.indices[idx * self.batch_size:(idx + 1) * self.batch_size]
        batch_x = []
        for i in range(self.n_inputs):
            batch_x.append(self.x[i][inds])
        batch_y = self.y[inds]
        return tuple(batch_x), batch_y

    def on_epoch_end(self):
        np.random.shuffle(self.indices)

    def nsamples(self):
        return self.x.shape[1]

    def get_y(self):
        return y

# data
BS = 6
x1, x2 = np.random.randn(100, 336),np.random.randn(100, 336,)
y = np.random.randn(100, 1)
x = np.array([x1, x2])
print(x[1].shape)
myseq = MultiArrayDataset(np.array([x1, x2]), y, batch_size=BS)

# model
input1 = Input(shape=(336,))
input2 = Input(shape=(336,))
input = Concatenate()([input1, input2])
x = Dense(2)(input)
x = Dense(1)(x)
model = Model(inputs=[input1, input2], outputs=x)
model.compile(
    optimizer = 'adam',
    loss = 'mean_squared_error',
    metrics=['accuracy']
)

# train with ktrain
learner = ktrain.get_learner(model, train_data=myseq, val_data=myseq, batch_size=BS)
learner.fit_onecycle(0.001, 1)
learner.view_top_losses()
amaiya commented 4 years ago

ktrain v0.9.0 was just released and includes an example notebook showing how to use custom dataset formats for custom models in ktrain. The basic idea is to subclass ktrain.Dataset (as illustarated above) to create a Sequence wrapper, which is what some others have done. Eventually, it would be nice to do this in a more seamless way as discussed previously in this thread but holding off on this for now, as it's a larger undertaking.