JuliaML / MLUtils.jl

Utilities and abstractions for Machine Learning tasks
MIT License
109 stars 22 forks source link

circularity in AbstractDataContainer definitions #57

Closed CarloLucibello closed 2 years ago

CarloLucibello commented 2 years ago

We currently have the following definitions for AbstractDataContainer where getindex falls back to getobs

abstract type AbstractDataContainer end

Base.getindex(x::AbstractDataContainer, i) = getobs(x, i)
Base.length(x::AbstractDataContainer) = numobs(x)
Base.size(x::AbstractDataContainer) = (length(x),)

Base.iterate(x::AbstractDataContainer, state = 1) =
    (state > length(x)) ? nothing : (x[state], state + 1)
Base.lastindex(x::AbstractDataContainer) = length(x)

and on the other end we have the generic fallback for getobs

getobs(x, i) = getindex(x, i)
numobs(x) = length(x)

I find this circularity a bit confusing and think it should be avoided. I suggest we change AbstractDataContainer to

abstract type AbstractDataContainer end

Base.iterate(x::AbstractDataContainer, state = 1) =
    (state > numobs(x)) ? nothing : (getobs(x, state), state + 1)
Base.lastindex(x::AbstractDataContainer) = numobs(x)

Then types inheriting from AbstractDataContainer:

As an addendum, let me remark that with the getobs(x, i) = getindex(x, i) fallback we are basically saying that we consider a Dataset any type implementing getindex, which is something that maybe we should document more. Should defining getindex be the recommended way for defining custom dataset types (even if not subtyping AbstractDataContainer)?

@darsnack related to https://github.com/JuliaML/MLDatasets.jl/pull/96

darsnack commented 2 years ago

I agree its very confusing to have these definitions. The way I make sense of it is that we have an equivalence: getobs <==> getindex. For types that don't implement anything in MLUtils, the fallback getobs(data, idx) = data[idx] ensures the equivalence. For types that do implement our interface, we automatically define getindex(data, i) = getobs(data, i) to reduce boilerplate (and ensure the equivalence holds). So, it is circular but a given type should only ever take one half of the circle path.

The main reason that I added these definitions to AbstractDataContainer is because I became very frustrated using MLDataPattern.jl in the REPL to inspect my data. No one wants to type out getobs(data, i), it's natural habit to just do data[i]. It was annoying to me that this randomly failed for certain containers in MLDataPattern.jl.

All this being said, I really like this proposal:

Should defining getindex be the recommended way for defining custom dataset types

This will ensure that any type has indexing and length working when it behaves like a vector. Thanks to the fallbacks, it should work with MLUtils.jl. And any type that has multiple dimensions can additionally specify a custom getobs. Seems like the cleanest solution to me.

For a lot of cases, this also means being able to work with the package without adding an MLUtils.jl dep.

CarloLucibello commented 2 years ago

So we can remove these lines?

Base.getindex(x::AbstractDataContainer, i) = getobs(x, i)
Base.length(x::AbstractDataContainer) = numobs(x)
Base.size(x::AbstractDataContainer) = (length(x),)
darsnack commented 2 years ago

Yeah let's do it.