JuliaML / TableTransforms.jl

Transforms and pipelines with tabular data in Julia
https://juliaml.github.io/TableTransforms.jl/stable
MIT License
103 stars 15 forks source link

Can TableTransforms do transforms on one row at a time? #145

Closed JockLawrie closed 1 year ago

JockLawrie commented 1 year ago

Hi there,

I have a project with this pipeline:

  1. Prepare data in SQL
  2. For each of 100s of statistical models, do select a table, apply transforms, convert to matrix, use for model training
  3. Use the trained models in an agent-based simulation.

The transforms used in Step 2 are also used in Step 3. The difference is that Step 2 applies the transforms to a table, whereas Step 3 applies them to a Dict (representing an agent's state). This state object can be thought of as a row/observation of a table.

With that in mind I've built a transforms package with the following API:

transformtable!(table, t)  # Used in model training
transform!(obs, t)   # Used in simulation

Then:

This gives me everything I need. I'd like to open source this functionality, but I'd rather not clutter the ecosystem if this functionality already exists or is planned. Does it exist in TableTransforms? Or is it planned?

Cheers, Jock

juliohm commented 1 year ago

Thank you Jock for opening the issue before moving forward with the creation of yet another package for table transforms 💯 We are happy to discuss and join efforts whenever possible.

Can you provide a simple concrete example of the functionality you have in mind? Are your transforms similar to the ones we already have implemented here (pre-processing transforms basically) or are you interested in learning models that consume a single observation at a time?

JockLawrie commented 1 year ago

Hi Julio,

All the transforms I'm using are pre-processing, similar to those in TableTransforms. A simple example:

using DataFrames
using RDatasets

iris = dataset("datasets", "iris")

t1 = CreateIntercept(:intercept)  # New variable is called :intercept
t2 = DummyEncode(:Species, '_')   # New binary variables are :Species$(sep)$(level), with sep='_'
# Or OneHotEncode(:Species, '_'; baselevel=:Setosa)
t3 = Ramp(:PetalLength, 3, :PetalLength_3plus)  # ramp(x, a) = max(x - a, 0). New variable is :PetalLength_3plus

transformtable!(iris, t1, t2, t3)  # Applied in sequence

Looks like TableTransforms has the equivalent of transformtable!. Can the equivalent of transform! be derived from the existing machinery in TableTransforms?

On a separate note, I am toying with the idea of post-processing transforms too. They're the same as the pre-processing transforms, just applied at a different point in the pipeline. For example, I'm using a Poisson regression to model counts that start at 1. A pre-processing transform is applied, Translate(:count, -1, :count_minus1), and the model is fitted. A post-processing transform could then be applied to get back to the original scale, Translate(:count_minus1, 1, :count).

If predict(model, x) is a point prediction then the post-processing transform applies. If it is a distribution, then it applies to the output of rand(d) instead, where d = predict(model, x).

juliohm commented 1 year ago

Can the equivalent of transform! be derived from the existing machinery in TableTransforms?

In your snippet of code I can't find the example with transform!, the example only contains transformtable!. Can you please double check? In any case, I think all we need here is a new Observation type that implements the Tables.jl API? A lazy type that simply wraps a tuple or vector of features and pretends to be a table. Everything should work out of the box.

On a separate note, I am toying with the idea of post-processing transforms too. They're the same as the pre-processing transforms, just applied at a different point in the pipeline. For example, I'm using a Poisson regression to model counts that start at 1. A pre-processing transform is applied, Translate(:count, -1, :count_minus1), and the model is fitted. A post-processing transform could then be applied to get back to the original scale, Translate(:count_minus1, 1, :count).

I thought about it in the past. Some sort of transform that represents a learning model. My concern is that pre-processing and post-processing is already a big deal. Adding learning models in the pipeline like MLJ.jl does for example is another layer of complexity that will soon require hyper-parameter tuning, etc. If you are willing to become a long-term contributor and maintainer, then it makes sense to consider this effort as a larger team. What do you think?

JockLawrie commented 1 year ago

transform! is used to transform an agent's state in an agent-based model, as in the example below.

Here the model is a struct that includes a trained underlying model and the list of pre-processing transforms. The fit and predict functions for the model are unrelated to the transforms, and are handled by a separate package. I'm keen to limit the scope of the transforms to simple pre/post processing - totally agree that this is a sufficiently large piece of work.

function simulate_value(model, state)
    for t in model.transforms
        transform!(state, t)
    end
    d = predict(model, state)  # d is a distribution
    state[:newvar] = rand(d)
end

In my project state is a Dict, but there's no reason transform! can't applied to any object representing an observation, including observations in tables.

For post-processing in the example above, something like the following could be appended to the function:

for t in model.post_transforms
    transform!(state, t)
end

The key point is that with the transform!/transformtable! API, the transforms used to pre-process training data can also be used in the agent-based model - there's no need to have have 2 versions of each transform.

Combined with TableIO.jl for input/output, the modelling package (similar to Models.jl), and some serialisation code, I have a modelling pipeline that can be specified completely in TOML/YAML/JSON. This is handy for my colleagues who aren't Julia programmers, and also the agent-based model re-uses the transforms that are specified for model training.

juliohm commented 1 year ago

Can you comment on the proposed solution with a new Observation type that simply wraps a tuple of features and implements the Tables.jl API? Wouldn't it solve the problem? We could preserve the current minimal API for tables and think of an observation as a table with a single row. There shouldn't be runtime overhead.

JockLawrie commented 1 year ago

This should work if the columns of an Observation can be created, modified and deleted. Is this the case? Can you point me to the source for Observation? I can't find it.

juliohm commented 1 year ago

@JockLawrie what I mean is that you can create your own custom Observation type or we could add it here as a PR if none is available yet. Defining a new wrapper type that implements the Tables.jl API is super straightforward. BTW it may be sufficient to just use the Tables.table wrapper?

julia> using Tables

julia> Tables.table([1 2 3])
Tables.MatrixTable{Matrix{Int64}} with 1 rows, 3 columns, and schema:
 :Column1  Int64
 :Column2  Int64
 :Column3  Int64

It takes built-in Julia representations and wraps it into a MatrixTable type that behaves like a table with default column names.

juliohm commented 1 year ago

@JockLawrie I am closing this assuming that the issue is solved with a wrapper observation type.

JockLawrie commented 1 year ago

Hi @juliohm,

Agree that an observation type would give correct results, but it wouldn't be fast because of allocation. The 3 options seem to be:

  1. Accept this speed penalty
  2. Write a method of each transform that dispatches on the observation type
  3. Seek a solution outside this package

Option 1 isn't viable for my purpose. Option 2 would clutter this project. I'll go with Option 3.

Thanks for the discussion, Jock