Closed ablaom closed 2 years ago
@OkonSamuel
https://github.com/JuliaML/MLDatasets.jl/pull/96 adds data containers that are MLUtils compatible including TableDataset
which wraps any Tables.jl compatible table. Tables.jl is just an interface, so we need a wrapper type to dispatch correctly. This way even a tuple that's intended to be a table can be treated as such.
Perhaps TableDataset
is more fundamental and should be moved to MLUtils directly.
@darsnack Thanks! That's good to know.
Perhaps TableDataset is more fundamental and should be moved to MLUtils directly.
Makes sense to me.
What are the downsides of using trait-based dispatch to avoid the wrapping, apart from meaning Tables.jl has to be a dependency of BaseLearn.jl (or whatever replaces that)? That is, assuming the tuples issue is resolved another way.
You mean using a traits package that turns istable
into a dispatch instead of a run time check?
Mmm, on further reflection, I guess there's no way to do this without breaking the API for current implementations, right?
Depends on what you're thinking. With SimpleTraits.jl that's not the case though I haven't checked whether @traitimpl IsTable{X} <- istable(X)
will incur no overhead.
Breaking changes to the API is not an issue, but I'd like to avoid needing to call istable
on everything by default (also a bit arbitrary to special case Tables.jl). And I think having tuples/named tuples follow the current behavior is more desirable. A table might be a named tuple but not all named tuples are tables. I think for a Base type, it makes more sense for the user to specify a special meaning.
I guess the other way of thinking about this is what is an example where something is Tables.jl compatible, but wrapping it in something like TableDataset
is not desirable/possible? For TableDataset
, we can make all the standard Tables.jl operations pass through to the underlying table so that the wrapper is nearly invisible beyond construction.
I think these are fair arguments. And, by the way, I do think the TableDataset work is a cool and valuable contribution (which should probably not be hidden away in MLDatasets).
However, I believe the run-of-the mill statistics user will find wrapping tables off-putting, to say the least.
Therefore, in the "tables-must-be-wrapped" proposal, it becomes the responsibility of every model providing package or ML tooling package to wrap and unwrap tables, yes? But then such a package needs to test whether or not to do so (unless it only deals with tables) essentially duplicating the trait-based dispatch which I am proposing goes into this base ML API interface in the first place, no?
I'm not sure why shuffleobs
isn't a varargs function (historical reasons?), but the big issue with auto-wrapping Tables.jl-compatible sources is this. In an absence of an efficient random access API to rely on, our best option is to rely on users implementing getobs
on their own dataset types. Always wrapping would mean consistently poor performance.
Good point I hadn't thought of. By always wrapping, you hide the original table type, for which there might be efficient random access for which you could otherwise specialise on . Instead you always have to go with the random access through the Tables.jl API which is generally bad.
Therefore, in the "tables-must-be-wrapped" proposal, it becomes the responsibility of every model providing package or ML tooling package to wrap and unwrap tables, yes? But then such a package needs to test whether or not to do so (unless it only deals with tables) essentially duplicating the trait-based dispatch which I am proposing goes into this base ML API interface in the first place, no?
Yeah that's a good point. I will see if there's an overhead to a trait-based approach and file a PR if not. Note that tuples would still be excluded from this path.
I will see if there's an overhead to a trait-based approach and file a PR if not.
Thanks indeed for consdering this. Will this addresses @ToucheSir 's important point? That is, would getobs(my_table)
always dispatch the inefficient tables fallback, even if an efficient getobs(::MyTable)
is implemented? Perhaps we also we need a isdatacontainer
trait (analogous to istable
) to resolve this issue. Is there something like this?
I still feel the current interpretation of tuples in MLUtils, as way to dispatch special behaviour (and inherited from MLDataPattern) is intrinsically flawed. This isn't just about tables. Why should we exclude implementation of the interface to objects that happen to be tuples (or any other data type)? What about @ToucheSir 's suggestion to use varargs, which "only" breaks use of shuffle
and its cousins in the multi-container case, as far as I can tell. Are there other reasons for not using varargs?
That is, would getobs(my_table) always dispatch the inefficient tables fallback, even if an efficient getobs(::MyTable) is implemented?
I will double check, but if getobs(::MyTable)
exists, then it would take precedence over getobs(::X) where {X, IsTable{X}}
. There is still the question of where these fast implementations live. Hopefully, DataFrames, etc. will be happy to accept a PR adding these definitions.
Why should we exclude implementation of the interface to objects that happen to be tuples (or any other data type)?
Unfortunately, Julia forces a one-or-none situation here. Let's say MLUtils had no special path for tuples, but we did have a trait-based path for Tables.jl tables. Then tuples will automatically take that path. If some other Foo.jl interface that's not Tables.jl is also compatible with tuples, then there is no way for tuples to selectively participate as a Tables.jl table or a Foo.jl object.
My point is that treating tuples as tables is the same as treating them the way we do now. The only difference is what interpretation is given favoritism. So the alternative to what we have now is that tuples have no meaning attached to them, or we decide which interpretation is the "golden" one.
as way to dispatch special behaviour
At face value, it looks like the tuple decision is about dispatching the inputs to a function, but it is really a consequence of the outputs. In Julia, there is a single, canonical way to return multiple values, and that is a tuple.
If I do sample = getobs(A, i)
for some array A
, then I can apply the whole MLUtils pipeline (shuffleobs
, etc.) on sample
. This is a feature of the package—that I can just write out the pipeline interleaving data processing, transformation, and iteration utilities. For a container that returns multiple parts in each sample, sample
will be a tuple (e.g. HDF5 datasets where various inputs and labels are stored under different HDF5 directories).
The alternative for MLUtils is to create some GroupedData
which would be fine to use for data sources, but cumbersome to enforce at intermediate stages in the pipeline (every type should need to wrap every multi-return value as a GroupedData
?). This would basically be MLUtils creating its own version of tuple
.
So, the decision that a group of "things" is a tuple is really the interpretation given by Julia's return behavior. Maybe no package should take ownership of tuple
in its interface, and MLUtils should do some GroupedData
thing. But I really do think we are adhering quite closely to the interpretation implied by Julia. IMO, it's Tables.jl that is misusing the notion of a tuple by turning it into a table (of course, I lack experience...maybe there is a really good reason that I'm missing).
^long explanation above being said, I don't want to imply that I feel so strongly about this that MLUtils can't budge...the solution here is going to have to be a compromise somewhere and that can certainly be this package.
Another option is to enforce explicit splatting in the pipeline
So, the decision that a group of "things" is a tuple is really the interpretation given by Julia's return behavior.
Aha. Yes, I get this now. You want tuples for the pipelining.
Another option is to enforce explicit splatting in the pipeline
For the user, you mean? Sorry, I'm not actually that familiar with the pipelining here - is there special syntax or are you just composing functions?
^long explanation above being said, I don't want to imply that I feel so strongly about this that MLUtils can't budge...the solution here is going to have to be a compromise somewhere and that can certainly be this package.
Thanks for being open about this. I'm kind of playing devil's advocate here, and this more subtle than I realised at first. It may may be that wrapping tables that don't have native implementations of getobs
is the safer path.
Mmm. What if we put logic in the wrapping operation that detects whether getobs
is implemented, and if it is, we just return the original (unwrapped) object. This addresses @ToucheSir 's concern, right?
For the user, you mean? Sorry, I'm not actually that familiar with the pipelining here - is there special syntax or are you just composing functions?
Yeah we just use |>
so this would mean asking the user to turn a tuple into a grouped container to pass to the next stage in the pipeline.
Mmm. What if we put logic in the wrapping operation that detects whether getobs is implemented, and if it is, we just return the original (unwrapped) object.
You mean that if someone tries to wrap a DataFrame
as a TableDataset
then we just pass through the DataFrame
instead of wrapping it (similarly for anything that implements getobs
directly)? We can certainly make that happen, but this would still require the user or downstream package to perform the wrapping operation? Just trying to understand where auto-wrapping would come into play here.
I think it would be good to support Tables.jl's tables without any wrappers. For numobs
we can change the generic fallback to
function numobs(data)
if Tables.istable(data)
return length(Tables.rows(data))
else
return length(data)
end
end
getobs
is more problematic since Tables.jl doesn't expose a getrow(table, i)
interface. We could just implement the following, that works for DataFrame
s
function getobs(data, idx::Union{Integer, AbstractVector{<:Integer}})
if Tables.istable(data)
return data[idx, :]
else
return data[idx]
end
end
At least this leaves us in a better position than where we currently stand.
Tuples and NameTuples current specializations should stay as they are instead.
@CarloLucibello (cc edited) Thanks for that. I agree it would be good to avoid wrapping. Summarizing discussions at https://github.com/JuliaData/Tables.jl/issues/276 and on slack, I think there will be no change at Tables.jl to rule out tuple-tables, and I suggest MLUtils.jl continues to treat tuples as multi-containers. On further investigation, ruling out tuple tables for ML is probably fine, at least for now.
I did not realize that MLUtils treats named tuples in a special way. Can you say more about this? That could be a deal-breaker as named tuples of equi-length vectors (called column tables) is a fundamental native Julia table type used all over the place.
As for the your getobs
proposal. I think we can improve on this and I suggest we wait and see how https://github.com/JuliaData/Tables.jl/pull/278 resolves. I think there is momentum for progress there. Is there some urgency at your end? If so, I could have a bash at suggesting something better.
It's treated the same as tuple: a multi-container like (features = ..., targets = ...)
will call getobs
, etc. on each entry in the named tuple as if it were a container.
Oh no 😭 .
AFAICT MLUtils and Tables jl are consistent in treating a namedtuple of vectors or tuples as column tables, can you elaborate on your concern?
Oh, I see. I guess that means column tables can only be indexed as if they were multi-containers of vectors (or tuples) (in the MLUtils sense). Nothing that Tables.jl does about column tables (eg, some change to the implementation of Tables.rows
) will have any effect on behaviour, no matter what we implement here for generic tables, but I guess that doesn't matter too much??
It will be possible to solve this issue using Tables.getrows
from https://github.com/JuliaData/Tables.jl/pull/284 and DataAPI.nrow
.
I think one could get greatly increase buy-in for MLUtil.jl if every Tables.jl compatible table would automatically implement the "data container" API. To get performance, one would still want to implement the concrete table types as well, but having it "just work" for all tables would be nice. I guess, since "table" is itself just an interface, rather than an abstract type, this would need to be implemented as part of the data container API, right? As Tables.jl is very lightweight, I don't see that as a big issue (and I could probably find someone to help with the integration).
Even so, there seems to be a problem implementing the interface for certain tables. MLUtils.jl interprets tuples in a very specific way. For example
shuffleobs((x1, x2))
treatsx1
andx2
as separate data containers, which are to be shuffled simultaneously, with the same base observation index shuffle. But some tables are tuples. The following example is even a tuple-table whose elements are themselves tables (of a different type):So is such a tuple a pair of data containers or a single data container? The current API cannot distinguish them.
I wonder:
Possibly this discussion is related.
Tables that are tuples are problematic elsewhere.
@oxinabox @rikhuijzer @darsnack