awkward use case: features that don't naturally combine into a table

Is your feature request related to a problem? Please describe. With the current interface it can be extremely awkward to combine features which do not naturally fit together in a table, particularly if they must be fed into separate models. For concreteness, take the following example

using MLJ
using DataFrames: DataFrame

const TSVD = @load TSVDTransformer verbosity=0
const LinearRegressor = @load LinearRegressor pkg=MLJLinearModels verbosity=0

function makedata(N::Integer=10^3)
    df1 = DataFrame(A=rand(1:3, N), y=rand(N))
    df2 = DataFrame(MLJ.table(randn(N, 10)), copycols=false)
    X = hcat(df1, df2)
    coerce!(X, :A=>Multiclass{3})
    unpack(X, ≠(:y), ==(:y))
end

function makemachine(X, y)
    Xs, ys = source.((X, y))

    # must use feature selector here also to avoid duplicates when concatenating
    fs = FeatureSelector(features=[:A])
    Xs1 = transform(machine(fs, Xs), Xs)
    h = OneHotEncoder()
    Xs1 = transform(machine(h, Xs1), Xs1)

    fs = FeatureSelector(features=[:A], ignore=true)
    Xs2 = transform(machine(fs, Xs), Xs)
    t = TSVD(nvals=2)
    Xs2 = transform(machine(t, Xs2), Xs2)

    Ξ = hcat(Xs1, node(DataFrame, Xs2))

    r = LinearRegressor()
    ŷ = predict(machine(r, Ξ, ys), Ξ)

    machine(Deterministic(), Xs, ys; predict=ŷ)
end

Note the presence of two different FeatureSelector's. In many cases, the existence of a features keyword in a model makes this process smoother, not only because it eliminates the need for a separate FeatureSelector, but more importantly because its outputs are already combined (i.e. it doesn't eliminate the non-selected features).

I find several features of this example problematic:

Creating the initial dataframe in the first place is slightly fiddly, and, more importantly, it is not clear to me that it can be made efficient in all cases, because matrices are necessarily split into column vectors. Note that I had to be careful to make sure there was no copying in the above example.
There is, as far as I can tell, no way not to concatenate A and the rest of the columns into a single dataframe (or other table), i.e. no matter what we have to pretend there is only a single training input. This might not be so bad, but again, it's a little worrying from a performance perspective since the input is necessarily so explicitly tabular.
It took some doing for me to figure out how to re-combine the outputs of the transformers into Ξ. hcat only works nicely on dataframes and machines don't appear to be constrained in the exact form of their output. This means that users are required to take apart their would-be model in order to figure out the exact form of each output that must be combined in some way.

Describe the solution you'd like It's of course possible I'm missing simpler options that already exist, though I did spend a significant portion of the day digging into this, so I don't think that's the case.

After some thought, I don't yet see a fantastic solution to this because most of the solutions I can think of would involve a significant re-work of Machine, which is certainly not ideal. Some ideas:

At a bare minimum, more documentation should be added to (perhaps here) to better cover this kind of use case, describing the caveats I've mentioned. I'd be glad to help with this.
Perhaps machines should be better at handling multiple inputs, though I'm apprehensive about this idea since it is likely to significantly complicate the interface. It seems there are some methods around in which machine can have multiple inputs, but I could not get it to work consistently.
Awkward situations like the one I've demonstrated above are much more likely to occur when using models that don't have a features keyword. The above example would be a lot simpler if TSVD had this (I deliberately chose TSVD because it does not). On the other hand, this seems like a fragile solution to me; for one, if my understanding of model implementations is correct, it would really suck to have to try to ensure that they always have certain keywords, but it also doesn't address what is perhaps a deeper issue of data not always being strictly tabular.
The most promising idea I've been able to come up with, but which I have not worked out in any detail, would be a more powerful and comprehensive alternative to hcat, perhaps involving some wrapper around the output... I am having a bit of a hard time coming up with a good example without promoting machine to take multiple input arguments though so... maybe if machine had multiple inputs only for surrogate models? Which is confusing. Just thinking out loud here.

@ExpandingMan Thank you for spending some substantial time with MLJ's learning networks. And your feedback is very much appreciated.

You raise a few interesting issues here, and I don't have any magic bullet to resolve them all. For now, let me focus on the problem of combining the output of different transformers.

Actually, unlesss I misunderstand, this is not really a problem with the learning networks API per se. If, more generally, I know how to horizontally concatenate two objects (for which ordinary hcat fails) then, in principle, I can use node to overload that functionality for use in a learning network. Indeed, in your example, you discovered a method (object1, object2) -> hcat(object1, DataFrame(object2)) which works for your particular case, and wrapped that with node to get what you needed. But that solution is not at all generic. It would be helpful if there was a version of hcat(X...) that just worked for arbitrary tables meeting the Tables.jl interface of possibly inhomogeneous type (and maybe even matrices and vectors too). Tables.jl does not provide such functionality, but TableOperations.jl might be persuaded to add it. If this existed, it would be a simple matter to overload that method to work on nodes, so you could call it the same way in a learning network as you ordinarily do. However, there are decisions to be made here, in particular, what should be the return type be?

The most promising idea I've been able to come up with, but which I have not worked out in any detail, would be a more powerful and comprehensive alternative to hcat, perhaps involving some wrapper around the output... I am having a bit of a hard time coming up with a good example without promoting machine to take multiple input arguments though so... maybe if machine had multiple inputs only for surrogate models? Which is confusing. Just thinking out loud here.

Yes, I think I am basically agreeing with you here - this is a promising direction. However, I am not quite sure why machines are relevant here, as we are just asking about an ordinary function that has multiple inputs. If, however, you want this "combining function" to have parameters (eg, output type) then you can define a Static model to do this; when you create a machine from such a model, you specify no training arguments (fit is a no-op) but your transform(model, fitresult, X...) can have as many inputs X as you like. There is an example here.

I must concede that MLJ's decision to try and work through the tables interface has some performance drawbacks. As you say, you have to think a lot more to avoid unnecessary copying. But even within that framework there is probably room for improvement and the built-in transformers provided by MLJModels could do with a review (Tables.jl was not very mature when this code was written). I note that TableTransforms.jl, AutoMLPipeline and elsewhere, transformers, such as OneHotEncoder only return that part of the table that is being transformed (that is the spawned categorical features, without the non-categoricals) and leaves re-combination to a final "hcat" step at the end (some kind of "+" operator is part of the syntax). Maybe that's a better model. I'm copying @OkonSamuel whose has an interest in these kind of issues.

Oh, by the way, a PR to clarify the status quo in the documentation would be very welcome.

@ExpandingMan Although it's not part of the public API, TableTransforms has the tablehcat method:

julia> table1
3×2 DataFrame
 Row │ x        z        
     │ Char     Float64  
─────┼───────────────────
   1 │ 𘂯       0.673471
   2 │ \U3f846  0.360792
   3 │ \Ud50cb  0.68075

julia> table2
(x = [0.41754294943943493, 0.7713462387833814, 0.9189998773436003], y = ['\U84fa1', '\U5e144', '\U872a4'])

julia> TableTransforms.tablehcat([table1, table2])
3×4 DataFrame
 Row │ x        z         x_        y       
     │ Char     Float64   Float64   Char    
─────┼──────────────────────────────────────
   1 │ 𘂯       0.673471  0.417543  \U84fa1
   2 │ \U3f846  0.360792  0.771346  \U5e144
   3 │ \Ud50cb  0.68075   0.919     \U872a4

Thanks for your responses.

Actually, unlesss I misunderstand, this is not really a problem with the learning networks API per se.

Right. The current API does indeed work correctly, as my initial example shows, it's a more a matter of awkwardness. It took me a little while to work out exactly what to do here (and, for what it's worth, I have a ton of Julia experience). Again, return types are potentially a major part of this issue: as far as I can tell there is no standard for what type is returned by a particular machine component, and to figure it out requires some trial and error with truncated learning networks.

I must concede that MLJ's decision to try and work through the tables interface has some performance drawbacks. As you say, you have to think a lot more to avoid unnecessary copying.

It seems this is the fundamental issue at the core of the matter. It seems to me that for machine learning what is needed is an object with n "features" and m "instances" (or "rows"), but that these objects are more general than those allowed by the Tables.jl interface, in particular, a feature can be an entire array and it will never be efficient to have to extricate these from the matrix which originally contained them.

I do agree that better ways of concatenating tables seems like the best medium term solutions, and it seems like that's already close.

Thanks again for taking the time to think about this.

It seems this is the fundamental issue at the core of the matter. It seems to me that for machine learning what is needed is an object with n "features" and m "instances" (or "rows"), but that these objects are more general than those allowed by the Tables.jl interface, in particular, a feature can be an entire array and it will never be efficient to have to extricate these from the matrix which originally contained them.

One possibility I've been thinking more about is the getobs interface (aka "data container") which the deep learning people are using. An individual observation can be anything, but you can index over observations (which could be individual image files, for example). Still, a lot of users like tables, and so there is some discussion around bringing these things together:

https://github.com/JuliaML/MLUtils.jl/issues/61 https://github.com/JuliaML/MLUtils.jl/issues/67 https://github.com/JuliaData/Tables.jl/pull/278 https://discourse.julialang.org/t/random-access-to-rows-of-a-table/77386

I can tell there is no standard for what type is returned by a particular machine component

So generally, transformers in MLJ that train on a table will transform to a table of the same type (assuming that is sink type). I think TSVDTransformer is a special case: if you train on a table, then transform returns a matrix-table. (If you train on a matrix, which is allowed, then you transform to a matrix, which could be sparse if the training matrix is.) I think the reason for this choice had to do with sparsity: the function Tables.materializer(X)(Xmat) need not be sparse even if X is a sparse matrix wrapped as a table, and Xmat is sparse.

I know it's been a while since I've commented on this, but I think I have run into another case that exposes the need for some kind of new feature here.

Currently OneHotEncoder does not return a sparse array, but it should at least have the option. However, once you have transformed something into a sparse matrix there is no way for models to know to use the entire matrix rather than views of individual columns which can come at a huge cost in efficiency. We need the ability to e.g. do a OneHotEncoder into a sparse matrix and then feed said matrix into PCA which has methods for it.

JuliaAI / MLJ.jl

awkward use case: features that don't naturally combine into a table #915