JuliaML / TableTransforms.jl

Transforms and pipelines with tabular data in Julia
https://juliaml.github.io/TableTransforms.jl/stable
MIT License
103 stars 15 forks source link

Pipes with parallel branches can return invalid tables #223

Closed CameronBieganek closed 9 months ago

CameronBieganek commented 10 months ago
using TableTransforms

t = (a=1:4, b=5:8)

left = Select(:a) → Filter(row -> row.a < 4)
right = Select(:b) → Filter(row -> row.b < 7)
transform = left ⊔ right;
julia> transform(t)
(a = [1, 2, 3], b = [5, 6])

This should probably throw an error. But even if it did throw an error, it would illustrate a degree of freedom in the API design that allows the user to shoot themselves in the foot.

(The API design solution in my mind is to remove table queries from TableTransforms.jl and stick to ML and statistics transforms that preserve the number of rows of the input table.)

CameronBieganek commented 10 months ago

At least you get an error if you use a table type that actually enforces table-ness:

julia> df = DataFrame(a=1:4, b=5:8);

julia> transform(df)
ERROR: DimensionMismatch: column :a has length 3 and column :b has length 2

So the lack of an error for named tuples has more do to with the fuzziness of the NamedTuple table implementation in Tables.jl.

But the API design issue that I mentioned still stands.

juliohm commented 10 months ago

I think that targeting corner cases in a dynamic language that doesn't support compile time checks is a bad goal. We definitely want to educate users on how to use these transforms well and safely, but we don't want to remove features just to make things that are currently possible, impossible.

ParadaCarleton commented 9 months ago

I think that targeting corner cases in a dynamic language that doesn't support compile time checks is a bad goal. We definitely want to educate users on how to use these transforms well and safely, but we don't want to remove features just to make things that are currently possible, impossible.

I think you might be misunderstanding the MWE above. This isn't a feature, it's a bug; there's no situation where you want to pass an invalid table as an input to TableTransforms.jl. This bug also isn't related to a lack of compile-time checks; checking size outputs with filter has to be done at runtime.

TableTransforms.jl isn't throwing an error when it produces an impossible/invalid output right now, or when it's given an input that isn't a table (meaning the assumptions of TableTransforms.jl are being violated). If that gets passed through a pipeline, it's going to cause lots of problems downstream and quite likely correctness issues.

juliohm commented 9 months ago

Sure, we can throw an error after the parallel transform has finished its job. Feel free to submit a PR with the proposed error message.

CameronBieganek commented 9 months ago

This particular MWE could be solved upstream by having the named-tuple-of-vectors materializer check that input column lengths are equal, which it currently does not:

julia> nt = (a=1:3, b=1:2)
(a = 1:3, b = 1:2)

julia> Tables.materializer(nt)(nt)
(a = 1:3, b = 1:2)

Then again, the correct fix for Tables.materializer is probably not within Tables.materializer but within the Tables.columns method for named tuples of vectors. There are two possible fixes for Tables.columns:

Currently Tables.jl is very lax with its handling of vectors of named tuples and named tuples of vectors.

CameronBieganek commented 9 months ago

Let me clarify the underlying design issue. Running filters in parallel branches of a pipeline and then horizontally concatenating the results almost never makes sense. Can anyone think of a use case where this would actually make sense? Normally the only meaningful way to horizontally combine tables is with a JOIN.

juliohm commented 9 months ago

You're proposing that we should add an assertion to our parallel transform to inhibt parallel branches that contain Filter?

Em ter., 31 de out. de 2023 16:18, Cameron Bieganek < @.***> escreveu:

Let me clarify the underlying design issue. Running filters in parallel branches of a pipeline and then horizontally concatenating the results almost never makes sense. Can anyone think of a use case where this would actually make sense? Normally the only meaningful way to horizontally combine tables is with a JOIN.

— Reply to this email directly, view it on GitHub https://github.com/JuliaML/TableTransforms.jl/issues/223#issuecomment-1787871678, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZQW3MYEKP3LR22Z7GL6BDYCFFIVAVCNFSM6AAAAAA6U7Z2D6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBXHA3TCNRXHA . You are receiving this because you modified the open/close state.Message ID: @.***>

CameronBieganek commented 9 months ago

I am going to decline to make a specific proposal. :)

But if you did something like that, it would also have to handle DropMissing and DropExtrema.

juliohm commented 9 months ago

The only sensible thing we can do here is add a check after the parallel branches are completed to make sure that all tables have the same number of rows. Will add to our roadmap.

CameronBieganek commented 9 months ago

There are lots of ways to corrupt your data with parallel pipelines. Here are a couple more:

1)

t = (a=[1, 2, 3], b=[3, 2, 1])

left = Select(:a)
right = Select(:b) → Sort(:b)
transform = left ⊔ right
julia> transform(t)
(a = [1, 2, 3], b = [1, 2, 3])

2)

t = (a=1:4, b=1:4)

left = Select(:a) → Sample(3)
right = Select(:b) → Sample(3)
transform = left ⊔ right
julia> transform(t)
(a = [2, 1, 3], b = [3, 2, 1])

Discussion

It's virtually impossible to correctly use Filter, DropMissing, DropExtrema, Sort, and Sample in parallel branches of a pipeline, so it would make sense to at least emit a warning, if not an error, when those transforms appear in parallel branches.

The alternative, as I mentioned in the original post, is to leave table queries to Query.jl and DataFrames.jl.

(Technically you can make the same mistakes with DataFrames because they have hcat, but you shouldn't normally be using hcat.)

juliohm commented 9 months ago

Both examples are valid. Maybe your mental model of parallel pipelines is not matching what we have here. Parallel pipelines simply combine the result of the branches with hcat, and that is clear from the docstring. There is no guarantee that rows of the original table will be processed together. If you ask to sort the column b and then hcat with column a, that is the expected result.

Regarding all transforms that change the number of rows, that is also expected. There are many other transforms that change the number of rows: Potrace, Rasterize.

I don't think that Query.jl or DataFrames.jl address the same problem. The feature we have here is quite unique and powerful for the applications we are developing.

CameronBieganek commented 9 months ago

Whether you have a row-orientated table or a column-orientated table, individual data records are contained in rows. The examples above destroy the integrity of the data records.

1)

The initial data records:

(1, 3)
(2, 2)
(3, 1)

The final data records:

(1, 1)
(2, 2)
(3, 3)

2)

The initial data records:

(1, 1)
(2, 2)
(3, 3)
(4, 4)

The final data records:

(2, 3)
(1, 2)
(3, 1)

It's hard to imagine a scenario where this behavior is useful. But if you're ok with data corruption, then... 🤷‍♂️

juliohm commented 9 months ago

It is because for you this is invalid, but for many other people it may not be. It doesn't matter if the table is row-major or column-major, the transformation can do anything with the table and produce another table. It is the user who will build sensible pipelines for statistical analysis.

You are coming from the perspective that we should only allow transforms that make some sort of statistical sense. Again, think more broadly and understand that TableTransforms.jl is about transformations of tables, not just stats or ML.

CameronBieganek commented 9 months ago

You are coming from the perspective that we should only allow transforms that make some sort of statistical sense.

No, this has nothing to do with statistics. It has everything to do with data contained in tables. The definition of a table is that data records are contained in rows. This is a broader concept than just statistical data. A table is fundamentally just a data structure for holding rows/records/tuples (whatever you want to call them). Each row is atomic, and each row contains the same number of elements. Mathematically a table is a relation---it's a subset of a cartesian product.

If that's not how you treat tables, then maybe you should name this package something else, like ArrayTransforms.jl.

juliohm commented 9 months ago

Tables are also a representation that associates column vectors to names. No matter the definition, we can operate on these tables and produce new tables. Invalidation as you describe can be a very delicate concept. Would renaming the columns of the table be an invalidation of records? Would replacing missing by 0 be an invalidation of records?

The issue here is still opened because we can throw a better error message for the case where the parallel transform is returning columns with different length. Other than that, the package name is completely fine.

Em ter., 31 de out. de 2023 18:45, Cameron Bieganek < @.***> escreveu:

You are coming from the perspective that we should only allow transforms that make some sort of statistical sense.

No, this has nothing to do with statistics. It has everything to do with data contained in tables. The definition of a table is that data records are contained in rows. This is a broader concept than just statistical data. A table is fundamentally just a data structure for holding rows/records/tuples (whatever you want to call them). Each row is atomic, and each row contains the same number of elements. Mathematically a table is a relation---it's a subset of a cartesian product.

If that's not how you treat tables, then maybe you should name this package something else, like ArrayTransforms.jl.

— Reply to this email directly, view it on GitHub https://github.com/JuliaML/TableTransforms.jl/issues/223#issuecomment-1788080158, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZQW3MKADVDP23HR7WCKELYCFWONAVCNFSM6AAAAAA6U7Z2D6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBYGA4DAMJVHA . You are receiving this because you modified the open/close state.Message ID: @.***>

CameronBieganek commented 9 months ago

Tables are also a representation that associates column vectors to names.

If tables were just a collection of named vectors, then the vectors could have different lengths. The requirement that tables have equal length columns follows from the definition of a table as a set of equal-sized rows.

Would renaming the columns of the table be an invalidation of records? Would replacing missing by 0 be an invalidation of records?

Obviously some table transformations make sense. But splicing together half of one row with half of a different row from the same table is a big no-no.

Anyhow, that's all I have to say on this subject.

juliohm commented 9 months ago

| Obviously some table transformations make sense. But splicing together half of one row with half of a different row from the same table is a big no-no.

One may ask why, and in order to answer this question you will likely need to rely on specific assumptions about the kinds of pipelines you have in mind. For me it is completely valid and useful to splice together half rows and columns together if the applications asks for it.

Thanks for the input.

Em ter., 31 de out. de 2023 19:11, Cameron Bieganek < @.***> escreveu:

Tables are also a representation that associates column vectors to names.

If tables were just a collection of named vectors, then the vectors could have different lengths. The requirement that tables have equal length columns follows from the definition of a table as a set of equal-sized rows.

Would renaming the columns of the table be an invalidation of records? Would replacing missing by 0 be an invalidation of records?

Obviously some table transformations make sense. But splicing together half of one row with half of a different row from the same table is a big no-no.

Anyhow, that's all I have to say on this subject.

— Reply to this email directly, view it on GitHub https://github.com/JuliaML/TableTransforms.jl/issues/223#issuecomment-1788107274, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZQW3IJHA4PRSQUN2K6RGLYCFZRXAVCNFSM6AAAAAA6U7Z2D6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBYGEYDOMRXGQ . You are receiving this because you modified the open/close state.Message ID: @.***>

ParadaCarleton commented 9 months ago

Both examples are valid. Maybe your mental model of parallel pipelines is not matching what we have here. Parallel pipelines simply combine the result of the branches with hcat, and that is clear from the docstring. There is no guarantee that rows of the original table will be processed together. If you ask to sort the column b and then hcat with column a, that is the expected result.

Right, so it's not wrong per se for TableTransforms.jl to do what the code literally says. However, it's a design that will likely lead to correctness bugs in users' code. The problem is it's easy for users to make mistakes or typos. When we have a suspicious-looking user input, our first question should be "Did the user actually mean to do this?"

When we filter or sort a table, we typically want to filter or sort either the columns or the rows of the table. Filter and Sort commands aren't intended to mix-and-match variables, e.g. matching one country's oil reserves with a different country's growth rate.

However, absent-minded programmers often write programs like this:

left = Select(:a) 
right = Select(:b) → Sort(:b)
transform = left ⊔ right

When what they really meant was this:

left = Select(:a) 
right = Select(:b)
transform = left ⊔ right → Sort(:b)

(Where the typo is in copy-pasting Sort one line higher than they meant.)

I think in the rare cases where a user does want to combine variables from different observations, it's better for them to write something more explicit, like permute!(data.oil, sortperm(data.growth)).

ParadaCarleton commented 9 months ago

@juliohm thank you so much for this! :smile:

juliohm commented 9 months ago

Thanks to @eliascarv who implemented the fix 👍🏽

We are working on #221 next.