JuliaML / TableTransforms.jl

Transforms and pipelines with tabular data in Julia
https://juliaml.github.io/TableTransforms.jl/stable
MIT License
103 stars 15 forks source link

Make API more compositional by removing column selection from the various transforms #222

Closed CameronBieganek closed 10 months ago

CameronBieganek commented 10 months ago

MLJ provides only linear pipelines (if one prefers to avoid learning networks), with most transformers having a features keyword argument that specifies to which columns transformations are applied. TableTransforms.jl on the other hand provides parallel branches to pipelines with the operator. It might be nice to carry this approach to its logical conclusion, which is to make transforms as atomic and compositional as possible. To make transforms more atomic, you could remove column selection from the various transforms and require an explict Select or Reject.

An example of this approach in action:

pipe = (Select(:a, :b) → ZScore()) ⊔ (Reject(:a, :b) → Quantile()) 

Obviously this is possible already, but making it mandatory would simplify the constructor signatures for many of the transforms.

juliohm commented 10 months ago

Disagree. The flexibility we added with column selectors is a key feature to make more powerful pipelines in a few lines of code. We won't drop features just because other frameworks don't support these features.

CameronBieganek commented 10 months ago

We won't drop features just because other frameworks don't support these features.

That's not at all what I'm saying. I know you like simple and elegant interfaces, so I was proposing a hypothetical change that would make the interface even more simple---without reducing the power of the interface in any way.

To be clear, I was not suggesting removing the vector-of-symbols and regex column selectors. I was only proposing that column selections be removed from ZScore, etc, so that this,

ZScore(:a, :b, r"hello")

is replaced by this:

Select(:a, :b, r"hello") → ZScore()

The functionality is the same, but it massively reduces the API surface of your package. Just think how much simpler the docstrings would be for all your transforms.

There are always design tradeoffs, of course. Many pipelines would be more verbose to express, because you would need to use the parallel branch operator a lot more.

juliohm commented 10 months ago

There are always design tradeoffs, of course. Many pipelines would be more verbose to express, because you would need to use the parallel branch operator a lot more.

Exactly. The proposal reduces expressivity, and we thought about it a long time ago when we decided to add column selectors to each transform when it makes sense. Practical usage suggests that columns selectors are very welcome in transforms as well.

juliohm commented 10 months ago

Also, notice that the two lines you shared are not equivalent:

ZScore(:a, :b, r"hello")

This line will apply the transform to the selected columns, preserving all other columns of the input table.

Select(:a, :b, r"hello") → ZScore()

This line will narrow the analysis to the selected columns, discarding all other columns from the pipeline.

CameronBieganek commented 10 months ago

Fair enough. Here's a better example. This,

ZScore(:a, :b) → OneHot(:c, :d)

would be rewritten as this:

(Select(:a, :b) → ZScore()) ⊔ (Select(:c, :d) → OneHot())

It's obviously more verbose, but it has a certain conceptual clarity and simplicity that I kind of like. But I could go either way on this particular design decision. I just wanted to float the idea for discussion.