Open ExpandingMan opened 2 years ago
@ExpandingMan Thank you for spending some substantial time with MLJ's learning networks. And your feedback is very much appreciated.
You raise a few interesting issues here, and I don't have any magic bullet to resolve them all. For now, let me focus on the problem of combining the output of different transformers.
Actually, unlesss I misunderstand, this is not really a problem with the learning networks API per se. If, more generally, I know how to horizontally concatenate two objects (for which ordinary hcat
fails) then, in principle, I can use node
to overload that functionality for use in a learning network. Indeed, in your example, you discovered a method (object1, object2) -> hcat(object1, DataFrame(object2))
which works for your particular case, and wrapped that with node
to get what you needed. But that solution is not at all generic. It would be helpful if there was a version of hcat(X...)
that just worked for arbitrary tables meeting the Tables.jl interface of possibly inhomogeneous type (and maybe even matrices and vectors too). Tables.jl does not provide such functionality, but TableOperations.jl might be persuaded to add it. If this existed, it would be a simple matter to overload that method to work on nodes, so you could call it the same way in a learning network as you ordinarily do. However, there are decisions to be made here, in particular, what should be the return type be?
The most promising idea I've been able to come up with, but which I have not worked out in any detail, would be a more powerful and comprehensive alternative to hcat, perhaps involving some wrapper around the output... I am having a bit of a hard time coming up with a good example without promoting machine to take multiple input arguments though so... maybe if machine had multiple inputs only for surrogate models? Which is confusing. Just thinking out loud here.
Yes, I think I am basically agreeing with you here - this is a promising direction. However, I am not quite sure why machines are relevant here, as we are just asking about an ordinary function that has multiple inputs. If, however, you want this "combining function" to have parameters (eg, output type) then you can define a Static
model to do this; when you create a machine from such a model, you specify no training arguments (fit
is a no-op) but your transform(model, fitresult, X...)
can have as many inputs X
as you like. There is an example here.
I must concede that MLJ's decision to try and work through the tables interface has some performance drawbacks. As you say, you have to think a lot more to avoid unnecessary copying. But even within that framework there is probably room for improvement and the built-in transformers provided by MLJModels could do with a review (Tables.jl was not very mature when this code was written). I note that TableTransforms.jl, AutoMLPipeline and elsewhere, transformers, such as OneHotEncoder
only return that part of the table that is being transformed (that is the spawned categorical features, without the non-categoricals) and leaves re-combination to a final "hcat" step at the end (some kind of "+" operator is part of the syntax). Maybe that's a better model. I'm copying @OkonSamuel whose has an interest in these kind of issues.
Oh, by the way, a PR to clarify the status quo in the documentation would be very welcome.
@ExpandingMan Although it's not part of the public API, TableTransforms has the tablehcat
method:
julia> table1
3×2 DataFrame
Row │ x z
│ Char Float64
─────┼───────────────────
1 │ 𘂯 0.673471
2 │ \U3f846 0.360792
3 │ \Ud50cb 0.68075
julia> table2
(x = [0.41754294943943493, 0.7713462387833814, 0.9189998773436003], y = ['\U84fa1', '\U5e144', '\U872a4'])
julia> TableTransforms.tablehcat([table1, table2])
3×4 DataFrame
Row │ x z x_ y
│ Char Float64 Float64 Char
─────┼──────────────────────────────────────
1 │ 𘂯 0.673471 0.417543 \U84fa1
2 │ \U3f846 0.360792 0.771346 \U5e144
3 │ \Ud50cb 0.68075 0.919 \U872a4
Thanks for your responses.
Actually, unlesss I misunderstand, this is not really a problem with the learning networks API per se.
Right. The current API does indeed work correctly, as my initial example shows, it's a more a matter of awkwardness. It took me a little while to work out exactly what to do here (and, for what it's worth, I have a ton of Julia experience). Again, return types are potentially a major part of this issue: as far as I can tell there is no standard for what type is returned by a particular machine component, and to figure it out requires some trial and error with truncated learning networks.
I must concede that MLJ's decision to try and work through the tables interface has some performance drawbacks. As you say, you have to think a lot more to avoid unnecessary copying.
It seems this is the fundamental issue at the core of the matter. It seems to me that for machine learning what is needed is an object with n
"features" and m
"instances" (or "rows"), but that these objects are more general than those allowed by the Tables.jl interface, in particular, a feature can be an entire array and it will never be efficient to have to extricate these from the matrix which originally contained them.
I do agree that better ways of concatenating tables seems like the best medium term solutions, and it seems like that's already close.
Thanks again for taking the time to think about this.
It seems this is the fundamental issue at the core of the matter. It seems to me that for machine learning what is needed is an object with n "features" and m "instances" (or "rows"), but that these objects are more general than those allowed by the Tables.jl interface, in particular, a feature can be an entire array and it will never be efficient to have to extricate these from the matrix which originally contained them.
One possibility I've been thinking more about is the getobs
interface (aka "data container") which the deep learning people are using. An individual observation can be anything, but you can index over observations (which could be individual image files, for example). Still, a lot of users like tables, and so there is some discussion around bringing these things together:
https://github.com/JuliaML/MLUtils.jl/issues/61 https://github.com/JuliaML/MLUtils.jl/issues/67 https://github.com/JuliaData/Tables.jl/pull/278 https://discourse.julialang.org/t/random-access-to-rows-of-a-table/77386
I can tell there is no standard for what type is returned by a particular machine component
So generally, transformers in MLJ that train on a table will transform to a table of the same type (assuming that is sink type). I think TSVDTransformer is a special case: if you train on a table, then transform returns a matrix-table. (If you train on a matrix, which is allowed, then you transform to a matrix, which could be sparse if the training matrix is.) I think the reason for this choice had to do with sparsity: the function Tables.materializer(X)(Xmat)
need not be sparse even if X
is a sparse matrix wrapped as a table, and Xmat
is sparse.
I know it's been a while since I've commented on this, but I think I have run into another case that exposes the need for some kind of new feature here.
Currently OneHotEncoder
does not return a sparse array, but it should at least have the option. However, once you have transformed something into a sparse matrix there is no way for models to know to use the entire matrix rather than views of individual columns which can come at a huge cost in efficiency. We need the ability to e.g. do a OneHotEncoder
into a sparse matrix and then feed said matrix into PCA which has methods for it.
Is your feature request related to a problem? Please describe. With the current interface it can be extremely awkward to combine features which do not naturally fit together in a table, particularly if they must be fed into separate models. For concreteness, take the following example
Note the presence of two different
FeatureSelector
's. In many cases, the existence of afeatures
keyword in a model makes this process smoother, not only because it eliminates the need for a separateFeatureSelector
, but more importantly because its outputs are already combined (i.e. it doesn't eliminate the non-selected features).I find several features of this example problematic:
A
and the rest of the columns into a single dataframe (or other table), i.e. no matter what we have to pretend there is only a single training input. This might not be so bad, but again, it's a little worrying from a performance perspective since the input is necessarily so explicitly tabular.Ξ
.hcat
only works nicely on dataframes and machines don't appear to be constrained in the exact form of their output. This means that users are required to take apart their would-be model in order to figure out the exact form of each output that must be combined in some way.Describe the solution you'd like It's of course possible I'm missing simpler options that already exist, though I did spend a significant portion of the day digging into this, so I don't think that's the case.
After some thought, I don't yet see a fantastic solution to this because most of the solutions I can think of would involve a significant re-work of
Machine
, which is certainly not ideal. Some ideas:machine
can have multiple inputs, but I could not get it to work consistently.features
keyword. The above example would be a lot simpler ifTSVD
had this (I deliberately choseTSVD
because it does not). On the other hand, this seems like a fragile solution to me; for one, if my understanding of model implementations is correct, it would really suck to have to try to ensure that they always have certain keywords, but it also doesn't address what is perhaps a deeper issue of data not always being strictly tabular.hcat
, perhaps involving some wrapper around the output... I am having a bit of a hard time coming up with a good example without promotingmachine
to take multiple input arguments though so... maybe ifmachine
had multiple inputs only for surrogate models? Which is confusing. Just thinking out loud here.