JuliaML / MLUtils.jl

Utilities and abstractions for Machine Learning tasks
MIT License
107 stars 20 forks source link

`eachobs(;batchsize)` vs `BatchView(;batchsize)` vs `DataLoader(;batchsize)` #79

Closed barucden closed 2 years ago

barucden commented 2 years ago

As the title suggests, I am wondering why there are three ways to iterate over batches:

using MLUtils
X = rand(4, 100)

it1 = eachobs(X, batchsize=10)
it2 = BatchView(X, batchsize=10)
it3 = DataLoader(X, batchsize=10)

for (x1, x2, x3) in zip(it1, it2, it3)
    @assert size(x1) == size(x2) == size(x3)
    @assert x1 == x2 == x3
end

Looking at the implementation, eachobs is implemented using BatchView, and DataLoader uses eachobs. So pardon my ignorant question but why not to have just one way of batch iteration providing all the features [shuffling, (partial) batching, etc.]?

darsnack commented 2 years ago

By design, the library tries to keep different operations (e.g. shuffling vs. batching) separate but composable. This makes it easy to re-order a pipeline to do exactly what you want it to.

DataLoader exists only as a convenience for folks coming from other ML frameworks where these would be under one "dataloader" class. Right now, DataLoader is just shuffling and batching, but eventually it will be a one-stop constructor that combines shuffling + batching + parallel loading.

BatchView is the underlying implementation of batching, but ideally, a user should never have to directly construct one. eachobs is the user facing function.

Right now, the library is a combination of porting code from Flux.Data and MLDataPattern. There is definitely redundancy that needs to be cleaned up.

ToucheSir commented 2 years ago

I would be remiss not to point out that the experimental torchdata library is trying a similar approach wrt composability.

barucden commented 2 years ago

Got it! Thank you for the explanation. I am closing the issue.