Open ablaom opened 3 years ago
I'm inclined to go with option 2, which is more user-friendly. The other issue ought to be solved on the tables interface side, in my opinion.
What about the opposite--a way to limit a multivariate transform to a subset of columns? This seems more general, since all multivariate transforms can be used as a univariate transform, but not vice-versa (e.g. PCA).
I'm not sure how I'd go about implementing this, though (given only MLJ primitives). Is there a package or interface used by MLJ for messing about with tables?
| Is there a package or interface used by MLJ for messing about with tables?
In MLJ a "table" is anything implementing the Tables.jl interface and satisfying Tables.istable(X) = true
. Unfortunately, the generality of Tables.jl makes it less than ideal for our purposes, as it aims to include out-of-memory tables and tables with an unknown numbers of rows (e.g., lazily iterated). The maintainers are very thoughtful, but reluctant to add any complexity. The API has no method to mutate columns in-place. There is now a Tables.subset
method for random access of rows, but this took a very long time to get. Maybe a new specialised pkg is needed, but no-one has ventured to write one.
The method MLJModelInterface.nrows(X)
will get you the number of rows, by basically materialising an entire column if necessary (see also "Aside" below).
MLJModelInterface has methods selectrows
and selectcols
, based on Tables.jl primitives, but I'd now recommend Tables.subset
over selectrows
. I expect TableTransforms.jl is your best bet for general table manipulations, although it's probably too heavy a dep for MLJModels.jl.
A package called TableOperations.jl provided some useful tools for tables, but is no longer maintained, as far as I can tell.
Aside Another interface, DataAPI
, provides the DataAPI.nrow
(not DataAPI.nrows
) that is implemented by DataFrames.jl and, more recently, some of the table types actually owned by Tables.jl
, such as a matrix table wrapper. I'd consider restricting MLJ's definition of table to require implementation of DataAPI.nrow
but that would be breaking. One reason for doing so is that tables with this feature also fit into the MLUtils.jl API.
It has been proposed on Slack that it be possible to have a single table transformer that transforms individual columns according to user-specified univariate transformations. This sounds like a good idea, which would also force some uniformity that's a little bit lacking in the current collection of table transformers.
In the most general case I can imagine implementing, the univariate transformer that applies to a particular column is defined by a function that operates on both the
name
andscitype
of the the column (as encoded in the tableschema
). This has the disadvantage that the user must specify a function with two arguments - or interact through some other complicated interface.The alternative would be a compositional approach. Each tabular transformer only carries out a single univariate transformer, applying to all specified
names
andscitypes
(or "not"-names and "not"-scitypes, throughignore
Boolean parameter), which would cover all conceivable use-cases. (columns not referred to are left alone). However, as we are currently locked into Tables.jl (which are non-mutable in general) we get a lot more copying of data.Thoughts anyone?