Closed CameronBieganek closed 10 months ago
Disagree. The flexibility we added with column selectors is a key feature to make more powerful pipelines in a few lines of code. We won't drop features just because other frameworks don't support these features.
We won't drop features just because other frameworks don't support these features.
That's not at all what I'm saying. I know you like simple and elegant interfaces, so I was proposing a hypothetical change that would make the interface even more simple---without reducing the power of the interface in any way.
To be clear, I was not suggesting removing the vector-of-symbols and regex column selectors. I was only proposing that column selections be removed from ZScore
, etc, so that this,
ZScore(:a, :b, r"hello")
is replaced by this:
Select(:a, :b, r"hello") → ZScore()
The functionality is the same, but it massively reduces the API surface of your package. Just think how much simpler the docstrings would be for all your transforms.
There are always design tradeoffs, of course. Many pipelines would be more verbose to express, because you would need to use the parallel branch operator a lot more.
There are always design tradeoffs, of course. Many pipelines would be more verbose to express, because you would need to use the parallel branch operator a lot more.
Exactly. The proposal reduces expressivity, and we thought about it a long time ago when we decided to add column selectors to each transform when it makes sense. Practical usage suggests that columns selectors are very welcome in transforms as well.
Also, notice that the two lines you shared are not equivalent:
ZScore(:a, :b, r"hello")
This line will apply the transform to the selected columns, preserving all other columns of the input table.
Select(:a, :b, r"hello") → ZScore()
This line will narrow the analysis to the selected columns, discarding all other columns from the pipeline.
Fair enough. Here's a better example. This,
ZScore(:a, :b) → OneHot(:c, :d)
would be rewritten as this:
(Select(:a, :b) → ZScore()) ⊔ (Select(:c, :d) → OneHot())
It's obviously more verbose, but it has a certain conceptual clarity and simplicity that I kind of like. But I could go either way on this particular design decision. I just wanted to float the idea for discussion.
MLJ provides only linear pipelines (if one prefers to avoid learning networks), with most transformers having a
features
keyword argument that specifies to which columns transformations are applied. TableTransforms.jl on the other hand provides parallel branches to pipelines with the⊔
operator. It might be nice to carry this approach to its logical conclusion, which is to make transforms as atomic and compositional as possible. To make transforms more atomic, you could remove column selection from the various transforms and require an explictSelect
orReject
.An example of this approach in action:
Obviously this is possible already, but making it mandatory would simplify the constructor signatures for many of the transforms.