JuliaData / TypedTables.jl

Simple, fast, column-based storage for data analysis in Julia
Other
147 stars 25 forks source link

Unexpected behaviour with iterator constructors #83

Closed adigitoleo closed 3 years ago

adigitoleo commented 3 years ago

Follow up from my last comment from today in #82. It seems there are a few similar issues so I'll close this if you want, but this isn't about implementing a new interface so much as stabilising what we have.

Constructing either a Table or a FlexTable using an iterator seems to have unpredictable consequences. Specifically, subsequent push! operations are duplicated (by the number of columns?):

julia> cols = (:this, :that)
(:this, :that)

julia> ftb = FlexTable(; zip(cols, fill([0], length(cols)))...)
FlexTable with 2 columns and 1 row:
     this  that
   ┌───────────
 1 │ 0     0

julia> new_row = (; zip(cols, fill(1, length(cols)))...)
(this = 1, that = 1)

julia> push!(ftb, new_row)
FlexTable with 2 columns and 3 rows:
     this  that
   ┌───────────
 1 │ 0     0
 2 │ 1     1
 3 │ 1     1

For examples using Table see the linked issue. I think it's better to block iterator constructors completely rather than have this behaviour. Note that I'm just following what the example in ?NamedTuple suggests:

  julia> (; zip(keys, values)...)
  (a = 1, b = 2, c = 3)
adigitoleo commented 3 years ago

Just leaving this here for reference. Disallowing iterators in the call might not be so limiting if something like #33 is implemented. Currently it is already possible to use columnnames to construct the rows. This results in a bit of repetition for the table construction but at least allows to add rows in a shorthand way:


julia> tb = Table(; A=Int[], B=Int[], C=Int[])
Table with 3 columns and 0 rows:
     A  B  C
   ┌────────

julia> colnames = columnnames(tb)
(:A, :B, :C)

julia> push!(tb, (; zip(colnames, fill(1, length(colnames)))...))
Table with 3 columns and 1 row:
     A  B  C
   ┌────────
 1 │ 1  1  1

julia> push!(tb, (; zip(colnames, fill(2, length(colnames)))...))
Table with 3 columns and 2 rows:
     A  B  C
   ┌────────
 1 │ 1  1  1
 2 │ 2  2  2
andyferris commented 3 years ago

This is actually a well-known problem with fill, not with the iterators, in which every object === all the others. If you construct the columns as their own vectors you get the correct behavior:

julia> cols = (:this, :that)
(:this, :that)

julia> data = (; (col => Int[] for col in cols)...)
(this = Int64[], that = Int64[])

julia> data.this === data.that
false

julia> t = Table(data)
Table with 2 columns and 0 rows:
     this  that
   ┌───────────

julia> push!(t, (; (col => 1 for col in cols)...))
Table with 2 columns and 1 row:
     this  that
   ┌───────────
 1 │ 1     1
andyferris commented 3 years ago

Here's possibly a clearer example of what's going on @adigitoleo:

julia> x = fill([0], 10)
10-element Vector{Vector{Int64}}:
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]

julia> push!(x[1], 1)
2-element Vector{Int64}:
 0
 1

julia> x
10-element Vector{Vector{Int64}}:
 [0, 1]
 [0, 1]
 [0, 1]
 [0, 1]
 [0, 1]
 [0, 1]
 [0, 1]
 [0, 1]
 [0, 1]
 [0, 1]

Does that make sense? With the Table it is push!ing to each column so you get multiple insertions to the same Vector.

andyferris commented 3 years ago

We could potentially check if any two columns are === but (a) that's actually fine for read-only operations and (b) it doesn't account for all possibly types of aliasing. There are actually loads of ways of corrupting a Table, which is unfortunate, but you get the same issues with Base.SubArray and other really common types.

I'm not sure we can do more without changing the language (e.g. add an ownership model like Rust) - so the status-quo is advising to take care when mutating things.

adigitoleo commented 3 years ago

@andyferris Thanks for clearing that up. I'm pretty new to Julia so I wasn't aware of this. I agree with your other points, and the way you showed for how to generate the columns should work fine.

Thanks for the package!