Clarify Data Iterator behaviors

flukeskywalker commented 8 years ago

Behavior such as nesting will depend on the type of data iterators. Currently, we can nest all available iterators such that the innermost ones are Online, Minibatches or Undivided. This works since all iterators work with Numpy arrays.

Once we have database iterators, things may change: the data attribute of the iterator may not contain named Numpy arrays as it currently does, since the entire dataset can not be held in memory. For such settings, we may choose to generalize iterators (so that they change behavior based on data type), or implement a separate set of iterators which can not be mixed. For example, we can have NumpyDataIterators and DatabaseIterators.

The best way forward might become clearer once we start working with larger datasets stored in databases/files.

Qwlouse commented 8 years ago

I've done some work on refactoring the data iterators and clarifying their behaviour. I've replaced the data attribute with:

data_shapes: a dictionary mapping names to shapes
length how many mini-batches this iterator will produce per epoch

That should be easy to provide for all data iterators and should work for numpy and database usecases. The only problem is, that it broke some of the validation code, that relied on checking for the data being instances of numpy arrays. We could do these checks during running of the iterators, but I don't think this is a big issue.

@flukeskywalker: If you agree I'll merge it to master and we can close this issue.

flukeskywalker commented 8 years ago

Looks good. We can close this for now. Checking that the data type is appropriate during running might be graceful for iterators that only work with Numpy arrays, otherwise calling them may crash and burn in sometimes confusing ways. If we completely ban other arrays types later, we can remove it.

IDSIA / brainstorm

Clarify Data Iterator behaviors #35