JuliaML / MLUtils.jl

Utilities and abstractions for Machine Learning tasks
MIT License
107 stars 20 forks source link

`splitobs` and `DataLoader` make views, but say they require only `getobs` #167

Open mcabbott opened 11 months ago

mcabbott commented 11 months ago

It's surprising that splitobs and DataLoader make views, when they mention only getobs in their docstrings, which does not:

help?> splitobs
  splitobs(data; at, shuffle=false) -> Tuple

  Split the data into multiple subsets proportional to the value(s) of at.

  If shuffle=true, randomly permute the observations before splitting.

  Supports any datatype implementing the numobs and getobs interfaces.
[...]

julia> splitobs(ones(1,5); at=0.2)[1]
1×1 view(::Matrix{Float64}, :, 1:1) with eltype Float64:
 1.0

julia> getobs(ones(1,5), 1:2)  # what it says it does
1×2 Matrix{Float64}:
 1.0  1.0

julia> obsview(ones(1,5), 1:2)  # what it actually uses
1×2 view(::Matrix{Float64}, :, 1:2) with eltype Float64:
 1.0  1.0

help?> DataLoader
search: DataLoader DataType

  DataLoader(data; [batchsize, buffer, collate, parallel, partial, rng, shuffle])

  An object that iterates over mini-batches of data, each mini-batch containing batchsize
  observations (except possibly the last one).

  Takes as input a single data array, a tuple (or a named tuple) of arrays, or in general any data
  object that implements the numobs and getobs methods.

[...]

This means that they do not preserve OneHotArrays, which is https://github.com/FluxML/OneHotArrays.jl/issues/40 .

But more generally, perhaps copies are just safer?

ToucheSir commented 11 months ago

Looking back through the blame, I'm guessing the idea was to not materialize large datasets. The easiest path would be to change the docs, but that doesn't solve the OneHotArrays issue. Maybe https://github.com/JuliaML/MLUtils.jl/blob/af7ebeacdf5a5e6e94e0db55a5dc24835ef260e0/src/obsview.jl#L224-L227 is too general and obsview should be returning ObsView for more types?

mcabbott commented 11 months ago

Yes I'm sure not materialising was the goal. However, some views are more useful than others. I wonder if the default should be something like splitobs(ones(1,10); at=0.5) makes two contiguous views, which are almost as good as Arrays, but splitobs(ones(1,10); at=0.5, shuffle=true) makes copies?

The OneHotArrays issue could be solved on that side, by changing what view does. Or more narrowly by changing what obsview does.

ToucheSir commented 11 months ago

Regardless of solution, the docstring should 100% be updated to mention some type of view is returned with ObsView being the detault.

The OneHotArrays issue could be solved on that side, by changing what view does

How straightforward do you think this would be? Since https://github.com/FluxML/OneHotArrays.jl/issues/40 is mostly about performance, another idea would be adding a kwarg which controls whether a view or copy is returned.