Closed CarloLucibello closed 2 years ago
I agree its very confusing to have these definitions. The way I make sense of it is that we have an equivalence: getobs <==> getindex
. For types that don't implement anything in MLUtils, the fallback getobs(data, idx) = data[idx]
ensures the equivalence. For types that do implement our interface, we automatically define getindex(data, i) = getobs(data, i)
to reduce boilerplate (and ensure the equivalence holds). So, it is circular but a given type should only ever take one half of the circle path.
The main reason that I added these definitions to AbstractDataContainer
is because I became very frustrated using MLDataPattern.jl in the REPL to inspect my data. No one wants to type out getobs(data, i)
, it's natural habit to just do data[i]
. It was annoying to me that this randomly failed for certain containers in MLDataPattern.jl.
All this being said, I really like this proposal:
Should defining getindex be the recommended way for defining custom dataset types
This will ensure that any type has indexing and length
working when it behaves like a vector. Thanks to the fallbacks, it should work with MLUtils.jl. And any type that has multiple dimensions can additionally specify a custom getobs
. Seems like the cleanest solution to me.
For a lot of cases, this also means being able to work with the package without adding an MLUtils.jl dep.
So we can remove these lines?
Base.getindex(x::AbstractDataContainer, i) = getobs(x, i)
Base.length(x::AbstractDataContainer) = numobs(x)
Base.size(x::AbstractDataContainer) = (length(x),)
Yeah let's do it.
We currently have the following definitions for
AbstractDataContainer
wheregetindex
falls back togetobs
and on the other end we have the generic fallback for
getobs
I find this circularity a bit confusing and think it should be avoided. I suggest we change
AbstractDataContainer
toThen types inheriting from
AbstractDataContainer
:getindex
if they want both the "indexing" interface and the "observable" interface.getobs
if for some reason they don't want to expose an indexing interfacegetobs
andgetindex
if the two interfaces serve different purposes (e.g. as with arrays)As an addendum, let me remark that with the
getobs(x, i) = getindex(x, i)
fallback we are basically saying that we consider a Dataset any type implementinggetindex
, which is something that maybe we should document more. Should defininggetindex
be the recommended way for defining custom dataset types (even if not subtypingAbstractDataContainer
)?@darsnack related to https://github.com/JuliaML/MLDatasets.jl/pull/96