Closed gkaissis closed 4 years ago
Great question and something that's been on my radar. An elegant solution would probably be to create a flexible layer of abstraction around datasets within ktrain itself to better support such scenarios.
I'll plan on looking into this and may use image segmentation as the "guinea pig" task. Thanks for letting me know.
Since I detect a whiff of FAST.AI
in ktrain
, I suppose you are referring to something along the lines of a Dataset
(or its PyTorch
equivalent)? That would allow us to separate the learner
from its data.
Yes, exactly.
Generally, I've rightly or wrongly tried to limit the amount of new object abstractions added on top of Keras, but a data abstraction layer would seem to make sense here and help avoid some ugly situations.
I think this is part of a larger software engineering discussion. Abstractions are completely justified if they are clear to the user. This is where I feel fast.ai took a wrong turn. It takes the PyTorch concept of separating the data, the way the data is fed to the model and the model itself and convolutes it into an unintuitive, poorly documented "DataBunch", full of unclear abbreviations inside function calls (freeze_wd
? bs
? detach
? denorm
?), opaque, unintuitive docstrings (Get one batch from the data loader of ds_type. Optionally detach and denorm.
), and behind-the-scenes interventions (setting dropout, setting weight decay, choosing training policies etc.), which are neither communicated clearly to the user, nor justified.
Therefore, every time one needs to fulfil some specific goal with the library which is even slightly outside its core scope (e.g. medical image segmentation with single channel images, a 3D CNN etc.), iit becomes necessary to start writing custom classes and tinkering around in the engine room to remove the "state of the art choices" made by the library thus completely reversing the premise of "it just works".
You have gone a very commendable way by just thinly wrapping Keras, one of the most transparent and well-written API specs ever created, and this carries over 1:1 to ktrain. It fundamentally serves to assist the experienced practitioner, looking for something beyond the basic "model.fit" idea to do state-of-the-art deep learning by providing helpful functions sitting atop inbuilt keras features, which is in my opinion a friendlier way to work, even for a beginner coming from Keras directly and looking to "up the game".
Thus, regarding the data abstraction layer, it could disappear in the background, kinda like the current ArrayLearner is (basically create an ArrayToArray learner for generative modelling of any kind, segmentation, image restoration etc.) and thus keeping the training process centred on the learner object instead of splitting it up between the data and its augmentations, the way it is loaded (in memory vs. streamed), and the processing algorithm.
Great points and I liked the ArrayToArray example.
Some problems might, in fact, require multiple input arrays. A fake news detector, for instance, might require an array for the text, a 1-hot-encoded array for the news source, and a 1-hot-encoded array for author - each of which is fed to its own embedding layer. Supporting these kinds of custom datasets and models should be easily supported within ktrain in the way you described and as seamlessly and painlessly as possible.
@gkaissis Better and more flexible support for alternative data formats (such as those used in your image segmentation example) is expected in ktrain ~v0.6.0~ v0.9.0. As of right now, I anticipate that this can be handled with some refactoring and extensions to the Learner
hierarchy in a way that is mostly seamless to users or, as you said, in a way that "disappears in the background".
Sounds amazing. As I am under extreme pressure from work at the moment, would it be OK if I contributed some documentation/ docstring typos, maybe a tutorial or something for the time being? Aiming to contribute more actively as soon as I have more time (at the moment I am working 7 days a week...).
Help is always welcome - feel free as you have time. But, if you're working 7 days a week right now, I'd recommend waiting until things ease up, as your life sounds pretty stressful right now. :) Good luck!
Hello, I think my question is somewhat related to the topic of this issue. I try to use multiple inputs and multiple outputs with ktrain. I get the ValueError that "data must be tuple of numpy.ndarrays or an instance of Iterator".
Is there currently a way to use ktrain with multiple inputs / outputs (train_data = ([x1,x2],[y1,y2,y3]) ?
@rousseau Yes, your question is very much related to this same issue.
Supporting this was originally intended to be released earlier but was delayed. It is now expected to be released as the next major update (i.e., ktrain-v0.9).
These updates should be able to support both the image segmentation example above in addition to your multi-input/multi-output use case. This issue will be updated when it is released (or when a pre-release is made for testing). Thanks.
Thank you very much for your quick answer. Can't wait for version 0.9! :-) In the meantime, could the use of a generator work?
@rousseau In ktrain v0.8.2 (just released), there is a minimalistic patch in the form of an abstract class ktrain.Dataset
. If ktrain.Dataset
is subclassed and the subclass wraps your dataset and implements the required methods, this may satisfy some use cases related to custom datasets and models in the interim. Here is a toy example for how to use it to support arbitrary multi-array inputs:
# imports
from tensorflow.keras.layers import *
from tensorflow.keras.models import Sequential, Model
import math
import numpy as np
import ktrain
# sequence wrapper for custom dataset
class MultiArrayDataset(ktrain.Dataset):
def __init__(self, x, y, batch_size=32):
if type(x) != np.ndarray or type(y) != np.ndarray:
raise ValueError('x and y must be numpy arrays')
if len(x.shape) != 3:
raise valueError('x must have 3 dimensions')
super().__init__(batch_size=batch_size)
self.x, self.y = x, y
self.indices = np.arange(self.x[0].shape[0])
self.n_inputs = x.shape[0]
def __len__(self):
return math.ceil(self.x[0].shape[0] / self.batch_size)
def __getitem__(self, idx):
inds = self.indices[idx * self.batch_size:(idx + 1) * self.batch_size]
batch_x = []
for i in range(self.n_inputs):
batch_x.append(self.x[i][inds])
batch_y = self.y[inds]
return tuple(batch_x), batch_y
def on_epoch_end(self):
np.random.shuffle(self.indices)
def nsamples(self):
return self.x.shape[1]
def get_y(self):
return y
# data
BS = 6
x1, x2 = np.random.randn(100, 336),np.random.randn(100, 336,)
y = np.random.randn(100, 1)
x = np.array([x1, x2])
print(x[1].shape)
myseq = MultiArrayDataset(np.array([x1, x2]), y, batch_size=BS)
# model
input1 = Input(shape=(336,))
input2 = Input(shape=(336,))
input = Concatenate()([input1, input2])
x = Dense(2)(input)
x = Dense(1)(x)
model = Model(inputs=[input1, input2], outputs=x)
model.compile(
optimizer = 'adam',
loss = 'mean_squared_error',
metrics=['accuracy']
)
# train with ktrain
learner = ktrain.get_learner(model, train_data=myseq, val_data=myseq, batch_size=BS)
learner.fit_onecycle(0.001, 1)
learner.view_top_losses()
ktrain v0.9.0 was just released and includes an example notebook showing how to use custom dataset formats for custom models in ktrain. The basic idea is to subclass ktrain.Dataset
(as illustarated above) to create a Sequence wrapper, which is what some others have done. Eventually, it would be nice to do this in a more seamless way as discussed previously in this thread but holding off on this for now, as it's a larger undertaking.
get_learner
fails when the training data is a zip of iterators such as when it is used for image segmentation tasks (while augmenting images and masks together).EDIT:
It works by hacking together a custom
Iterator
class, but it's not a particularly elegant hack...image_gen
andmask_gen
below arekeras.preprocessing.image.ImageDataGenerator.flow_from_directory()
objects.Any ideas how we could make this more elegant?