JuliaData / DataFrames.jl

In-memory tabular data in Julia
https://dataframes.juliadata.org/stable/
Other
1.74k stars 370 forks source link

DataFrames replacement that has no type-uncertainty #744

Closed tshort closed 5 years ago

tshort commented 9 years ago

@johnmyleswhite started this interesting email thread:

https://groups.google.com/forum/#!topic/julia-dev/hS1DAUciv3M

Discussions included:

The following issues in Base may help with type certainty:

johnmyleswhite commented 9 years ago

Thanks for writing this up, Tom.

One point I'd make: we should try to decouple interface and implementation. Whether DataFrames are row-oriented or column-oriented shouldn't matter that much if we can offer tolerable performance for both iterating over rows and extracting whole columns. (Of course, defining tolerable performance is a tricky matter.) In particular, I'd like to see a DataFrames implementation that directly wraps SQLite3. Defining the interface at that level means that you can switch between SQLite3's in-memory database and something custom written for Julia readily depending on your particular applications performance characteristics.

vchuravy commented 9 years ago

I am thinking about adding a PostgresSQL client for Julia and I would like to expose a interface via DataFrames, so I am very much in favour of decoupling interface and implementation.

I would like to see an interface that takes into account datasets that are larger than the available memory on the client and require some sort of streaming.

tonyhffong commented 9 years ago

A bit against-the-grain question: so if a goal of dataframe is to be an in-memory front end of a potentially much larger database (or a powerful engine), is performance of this thin layer that critical? It almost feels that flexibility and expressive power trumps raw performance in that case.

simonbyrne commented 9 years ago

@tonyhffong I think the intention is that it can be both: there will be a pure julia DataFrame for general use, but you can also swap this for a different backend without changing code.

One other topic perhaps worth considering is indexing: in particular, what interface to use, and do we want pandas-style hierarchical indexing?

stevengj commented 9 years ago

If dot overloading (JuliaLang/julia#1974) is implemented well, could df[1, :a] be replaced by df[1].a while still using @simonbyrne's staged-function tricks?

MikeInnes commented 9 years ago

Since that would expand to getfield(df[1], Field{:a}()) you could certainly use the staged function trick in principle. But that depends heavily on how efficient df[1] is as well. Data frame slices might be needed.

simonster commented 9 years ago

If we had "rerun type inference after inlining in some cases" from #3440, that would probably also be sufficient to make df[1, :a] work. Or some hack to provide type information for arbitrary functions given the Exprs they were passed, similar to what inference.jl does for tupleref.

MikeInnes commented 9 years ago

I mentioned this on the mailing list, but some kind of @pure notation that gives the compiler freedom to partially evaluate the function when it has compile-time-known arguments would also be sufficient – certainly for this and other indexing purposes, and perhaps others.

stevengj commented 9 years ago

@one-more-minute, see JuliaLang/julia#414

johnmyleswhite commented 9 years ago

This is just a framing point, but one way that I'd like to talk about this issue is in terms of "leaving money on the table", rather than in terms of performance optimization. A standard database engine gives you lots of type information that we're just throwing out. We're not exploiting knowledge about whether a column is of type T and we're also not exploiting knowledge about whether a column contains nulls or not. Many other design issues (e.g. row-oriented vs. column-oriented) depend on assumptions about how data will be accessed, but type information is a strict improvement (up to compilation costs and the risks of over-using memory from compiling too many variants of a function).

johnmyleswhite commented 9 years ago

I tried to write down two very simple pieces of code that exemplify the very deep conceptual problems we need to solve if we'd like to unify the DataFrames and SQL table models of computation: https://gist.github.com/johnmyleswhite/584cd12bb51c27a19725

teucer commented 9 years ago

To reiterate @tonyhffong's point, I wonder, maybe naively, why one cannot use an SQLite in-memory database and an interface a la dplyr to carry out all the analyses. I have the impression that database engines have solved and optimised a lot of issues that we are trying to address here. Besides one of the main frustrations with R (at least mine) is the fact that large data sets cannot be handled directly. This would also remediate that issue.

I can foresee some limitations with this approach

johnmyleswhite commented 9 years ago

Using SQLite3 as a backend is something that would be worth exploring. There are also lots of good ideas in dplyr.

That said, I don't really think that using SQLite3 resolves the biggest unsolved problem, which is how to express to Julia that we have very strong type information about dataframes/databases, but which is only available at run-time. To borrow an idea from @simonster, the big issue is how to do something like:

function sum_column_1(path::String)
    df = readtable(path)

    s = 0.0

    for row in eachrow(df)
        s += row.column_1
    end

    return s
end

The best case scenario I can see for this function is to defer compilation of everything after the call to readtable and then compile the rest of the body of the function after readtable has produced a concrete type for df. There are, of course, other ways to achieve this effect (including calling a second function from inside of this function), but it's a shame that naive code like the following should suffer from so much type uncertainty that could, in principle, be avoided by deferring some of the type-specialization process.

teucer commented 9 years ago

Your example is what I meant by writing efficient custom functions. A possibility that I can see is to "somehow" compile Julia functions in SQLite user-defined functions. But this is probably cumbersome.

datnamer commented 9 years ago

@johnmyleswhite Here is a python data interop protocol for working with external databases etc through blaze (numpy/pandas 2 ecosystem) http://datashape.pydata.org/overview.html

It is currently being used only to lower/ JIT expressions on numpy arrays, but facilitates interop and discovery with other backends: http://matthewrocklin.com/blog/work/2014/11/19/Blaze-Datasets/

Not sure if there are ideas here that can help in any way, but thought I would drop it in regardless.

johnmyleswhite commented 9 years ago

These things are definitely useful. I think we need to think about they interact with Julia's existing static-analysis JIT.

datnamer commented 9 years ago

Glad it is helpful. Here is the coordinating library that connects to these projects: https://github.com/ContinuumIO/blaze It itself has some good ideas for chunking, streaming etc

Here is a scheduler for doing out of core ops: http://matthewrocklin.com/blog/work/2015/01/16/Towards-OOC-SpillToDisk/

The graphs optimized to remove unnecessary computation https://github.com/ContinuumIO/dask/pull/20

Maybe after some introspection, dataframes can use blocks.jl to stream database into memory transparently. Does Julia have facilities to build and optimize parallel scheduling expression graphs?

jrevels commented 9 years ago

@johnmyleswhite I ended up coding something that could be useful during your talk about this earlier today, and then I found this issue so I figured the most convenient option might be to discuss it here.

I came up with a rough sketch of a type-stable and type-specific implementation of DataFrames; a gist can be here. @simonbyrne's prototype ended up heavily informing how I structured the code, so it should look somewhat similar.

Pros for this implementation:

Cons:

jrevels commented 9 years ago

Note that, given the above implementation, it's also pretty easy to add type-stable/specific methods for getindex that have the other indexing behaviors currently defined on DataFrames (e.g. df[i, Field{:s}], df[i]).

Edit: Just added this to the gist for good measure. Seeing it in action:

julia> df = @dframe(:numbers = collect(1:10), :letters = 'a':'j')
DataFrame{Tuple{Array{Int64,1},StepRange{Char,Int64}},Tuple{:numbers,:letters}}(([1,2,3,4,5,6,7,8,9,10],'a':1:'j'),Tuple{:numbers,:letters})

julia> df[1]
10-element Array{Int64,1}:
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10

julia> df[2, Field{:numbers}]
2

This implementation also easily supports row slicing:

julia> df[1:3, Field{:numbers}]
3-element Array{Int64,1}:
 1
 2
 3

(...though I'm not sure whether the lack of support for row slicing is by design or not in the current DataFrames implementation)

tshort commented 9 years ago

Nice @jrevels! The main drawback I see is that type-stable indexing is still cumbersome. You need df[Field{:numbers}]rather than df.numbers.

jrevels commented 9 years ago

@tshort True. I'm not sure that this implementation could ever support a syntax that clean, but I can naively think of few alternatives that could at least make it a little easier to deal with:

  1. Have an access macro: @field df.numbers that expands to df[Field{:numbers}]. This is still kind of annoying to write, but at least reads like the nice syntax you propose. Coded properly, you could write something like @field df.numbers[1] + df.numbers[2] and the macro could expand each use of df.* to df[Field{:*}].
  2. Shorten the name of the Field type, e.g. abstract fld{f} so that access looks like df[fld{:numbers}]. This makes it a bit easier to type, but IMO makes it even harder to read, and is probably uglier than is acceptable.
  3. Use constants assigned to the appropriate Field{f} type. For example:
julia> const numbers = Field{:numbers}
Field{:numbers}

julia> df[numbers]
10-element Array{Int64,1}:
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10

This results in a nice syntax, but of course requires reserving a name for the constant. This could be done automatically as part of the @dframe macro, but I think unexpectedly introducing new constants into the user's environment might be too intrusive.

simonbyrne commented 9 years ago

Unless we specialise DataFrames into the language somehow (e.g., by making them act like Modules), I suspect that this sort of idea is likely to be the most useful.

The key downside from a performance perspective is going to be the JIT overhead every time you apply a function to a new DataFrame signature, though I'm not sure how important this is likely to be in practice.

johnmyleswhite commented 9 years ago

This is great. I'm totally onboard with this.

My one source of hesitation is that I'm not really sure what performance problems we want to solve are. What I like about this is that it puts us in a position to use staged functions to make iterating over the rows of a DataFrame fast. That's a big win. But I think we still need radical changes to introduce indexing into DataFrames and even more work to provide support for SQL-style queries.

In an ideal world, we could revamp DataFrames with this trick while also exploring a SQLite-backed approach. If we're worried about labor resources, I wonder if we need to write out some benchmarks that establish what kinds of functionality we want to provide and what we want to make fast.

jrevels commented 9 years ago

If you guys decide that it's worth it, I'd love to help refactor DataFrames to use the proposed implementation. I'd just have to get up to speed with where the package is at, given that such a refactor might also entail breaking API changes. If you think that DataFrames isn't ready, or that a large refactor is untenable given the SQL-ish direction that you want to eventually trend towards, that's also cool. Once decisions have been made as to what should be done, feel free to let me know how I can help.

The key downside from a performance perspective is going to be the JIT overhead every time you apply a function to a new DataFrame signature, though I'm not sure how important this is likely to be in practice.

I imagine that the average user won't be generating so many different types of DataFrames in a given session that it will ever be an issue (though I could be wrong in this assumption).

johnmyleswhite commented 9 years ago

I imagine that the average user won't be generating so many different types of DataFrames in a given session that it will ever be an issue (though I could be wrong in this assumption).

I tend to agree. I'd be surprised if the number of distinct DataFrames exceeded 100 in most programs, although I'm sure somebody will attempt to build a million column DataFrame by incremental calls to hcat.

teucer commented 9 years ago

I still have the feeling that this is reinventing the wheel if the end goal is to have an SQL-like/style syntax!

I recently stumbled upon monetdb (https://www.monetdb.org/Home).

They have embedded R in the database (https://www.monetdb.org/content/embedded-r-monetdb) and developed a package to access from R via dplyr.

I know this is not "pure" Julia, but I could very well imagine that one can work with such a technology which would also enable to have user defined Julia functions "transformed" to database level functions. Next step would be to have something like this http://hannes.muehleisen.org/ssdbm2014-r-embedded-monetdb-cr.pdf

Would an idea like this be worth exploring?

nalimilan commented 9 years ago

@teucer The problem is that for Julia to generate efficient code, it needs to know the types of the input variables when compiling the functions. dplyr and its connections to databases are very interesting, but they don't solve the type-stability issue if we want fast Julia code operating on data frames.

datnamer commented 9 years ago

@nalimilan The monetdb example is more interesting than dplyr's SQL generation because it utilizes R to produce native database UDF, if I understand correctly.

teucer commented 9 years ago

@nalimilan @datnamer yes that's exactly the point! All the heavy work will be done by the database itself.

There will be still some latency between Julia and the database though. In my opinion, the "zero copy integration" (last link) nicely addresses this point.

jrevels commented 9 years ago

So it appears there's some debate as to whether or not folks want DataFrames to be a database interface, a purely Julian implementation, or some hybrid approach. I don't have enough knowledge to advocate for any of these cases - I was just linking the proposal I made because it fixes some type uncertainty issues that exist in the current DataFrames implementation.

@johnmyleswhite I did a little bit more work on the gist to come up with an indexing structure that makes sense for the implementation, and enables type-stable indexing of entire rows rather than just columns/elements. Seeing it in action:

julia> df = @dframe(:numbers = collect(0.0:0.1:0.4), :letters = 'a':'e')
DataFrame{Tuple{Array{Float64,1},StepRange{Char,Int64}},Tuple{:numbers,:letters}}(([0.0,0.1,0.2,0.3,0.4],'a':1:'e'),Tuple{:numbers,:letters})

julia> df[Field{:numbers}]
5-element Array{Float64,1}:
 0.0
 0.1
 0.2
 0.3
 0.4

julia> df[Col{1}]
5-element Array{Float64,1}:
 0.0
 0.1
 0.2
 0.3
 0.4

julia> df[:, Field{:numbers}] == df[Field{:numbers}]
true

julia> df[:, Col{1}] == df[Col{1}]
true

julia> df[3, Col{1}]
0.2

julia> df[3, Field{:numbers}]
0.2

julia> df[3:5, Col{1}]
3-element Array{Float64,1}:
 0.2
 0.3
 0.4

julia> df[3:5, Field{:numbers}]
3-element Array{Float64,1}:
 0.2
 0.3
 0.4

julia> df[3, :] # 3rd row
DataFrame{Tuple{Float64,Char},Tuple{:numbers,:letters}}((0.2,'c'),Tuple{:numbers,:letters})

julia> df[2:4, :] # 2:4 rows; has type typeof(df)
DataFrame{Tuple{Array{Float64,1},StepRange{Char,Int64}},Tuple{:numbers,:letters}}(([0.1,0.2,0.3],'b':1:'d'),Tuple{:numbers,:letters})

julia> df[:,:] # copy of df through slicing
DataFrame{Tuple{Array{Float64,1},StepRange{Char,Int64}},Tuple{:numbers,:letters}}(([0.0,0.1,0.2,0.3,0.4],'a':1:'e'),Tuple{:numbers,:letters})

EDIT: fixed incorrect typing comment

jrevels commented 9 years ago

I just realized that the above also supports indexing row by names if you use Dicts as your columns:

julia> df = @dframe(:first = Dict('a' => 1, 'b' => 2), :second => Dict('a' => 3.0, 'b' => 4.5))
DataFrame{Tuple{Dict{Char,Int64},Dict{Char,Float64}},Tuple{:first,:second}}((Dict('b'=>2,'a'=>1),Dict('b'=>4.5,'a'=>3.0)),Tuple{:first,:second})

julia> df['b', :]
DataFrame{Tuple{Int64,Float64},Tuple{:first,:second}}((2,4.5),Tuple{:first,:second})

julia> df['b', Field{:first}]
2

Nifty!

jrevels commented 9 years ago

On the above: A more serious implementation of name-based row indexing would obviously want to avoid storing duplicate keys:

type NamedRowDataFrame{K,C<:Tuple,F}
    df::DataFrame{C,F}
    rows::Dict{K,Int} # provides row name --> row index map
end
simonbyrne commented 9 years ago

A few quick thoughts:

johnmyleswhite commented 9 years ago

I certainly think the lazy accumulation of transformations is the way to go. That coupled with some of the work that Jacob Quinn showed today on SQLite.jl + Tables.jl would get us into a vastly better state than we're in now.

teucer commented 9 years ago

@simonbyrne @johnmyleswhite The first point seems indeed to be the trend.

While it is true that dplyr is providing a unified syntax to deal with the datasources, this is, in my opinion, not the big novelty here. The bigger change that I see is that now R is more and more embedded in the databases and has become a first class citizen at the database level:

What I like about monetdb is that it is a column store and potentially offers big performance gains for aggregation and filtering (at least in my use cases I don't often have situation where I have to add new rows to that data.frames): http://stackoverflow.com/a/980941/279497

johnmyleswhite commented 9 years ago

SQLite already has "embedding" of Julia in the DB.

davidagold commented 9 years ago

I've been experimenting with another scheme for type-certain indexing. It introduces a new ColType{T} immutable, which is used instead of the DataFrame object to communicate the return type of a call to getindex. Essentially you have the following:

type DataFrame
    columns::Vector{Vector}
end

immutable ColName{T}
    idx::Int
end

@inline function Base.getindex{T}(df::DataFrame, i::Int, col::ColName{T})
    return df.columns[col.idx][i]::T
end

The above is of course just one way to organize the columns field -- one could use a Dict or even a Matrix{Any}, which would still yield type-certain indexing. The latter option may help with memory layout, though it may use more memory on the whole because of the Any.

This scheme has the potential to be very user friendly, at a cost. If one is willing to use eval(current_module(), ...) to assign globally the names of the columns to ColName{T} objects, then one can achieve very natural, type-certain indexing. The cost, aside from using eval and assigning names in the user's namespace, is that one has to be careful when working with multiple DataFrames not to use the same column name in two tables unless the columns have the same position in both tables and are of the same eltype.

I've been playing with these ideas in (warning! rough code ahead) this branch of a repo hosting an experimental Table type that can be materialized from an (equally experimental) abstract interface to a SQLite3 backend. I've essentially lifted the SQLite3 interface from @quinnj 's SQLite.jl. Here's what it looks like in action so far:

julia> using Nierika

julia> db = DB("db/test.db")
Nierika.DB("db/test.db",Ptr{Void} @0x00007faa378b4a10,0)

julia> tab = select(db, "person") do
           where("id > 1")
       end  |> Table
2x4 Table with columns
(id, first_name, last_name, age):
 2  "Santa"   "Clause"  400
 3  "Barack"  "Obama"    53

julia> tab[2, last_name]
"Obama"

julia> @code_warntype tab[2, last_name]
Variables:
  tab::Nierika.Table
  i::Int64
  col::Nierika.ColName{UTF8String}

Body:
  begin 
      $(Expr(:meta, :inline)) # /Users/David/.julia/v0.4/Nierika/src/table.jl, line 35:
      return (top(typeassert))((Nierika.getindex)((top(getfield))(tab::Nierika.Table,:columns)::Any,i::Int64,(top(getfield))(col::Nierika.ColName{UTF8String},:idx)::Int64)::Any,T)::UTF8String
  end::UTF8String

julia> last_name
Nierika.ColName{UTF8String}(3,"last_name")

Another potential benefit of using ColNames is that by overloading comparison operators for ColName and other arguments, one could eventually drop the need for the string arguments in the select and where methods illustrated above and just have

select(db, person) do
    where(id > 1)
end |> Table

which allows for a delayed evaluation-like effect up front and the lazy accumulation of SQL commands behind the scenes -- all without requiring the user to code with strings or use macros (One would of course need an analogous TableName type).

I don't know if the pros to this approach can outweigh the cons, but I find it interesting and intend to follow it for the time being. Any and all thoughts are appreciated!

jrevels commented 9 years ago

This scheme has the potential to be very user friendly, at a cost. If one is willing to use eval(current_module(), ...) to assign globally the names of the columns to ColName{T} objects, then one can achieve very natural, type-certain indexing.

My and @simonbyrne's implementations also support this idea, but I believe your implementation is way cooler than the ones posited so far for a few reasons that haven't yet been explicitly listed:

I have a question about this definition, though:

type DataFrame
    columns::Vector{Vector}
end

Type-certain indexing is granted by the new getindex definition, and avoiding full exposure of type information solves the JIT overhead issue, as mentioned. However, could the ambiguity of Vector{Vector} (or Matrix{Any}, Dict, etc.) cause any issues with other functions, or perhaps memory boxing issues?

Another potential benefit of using ColNames is that by overloading comparison operators for ColName and other arguments, one could eventually drop the need for the string arguments in the select and where methods illustrated above and just have

select(db, person) do where(id > 1) end |> Table

which allows for a delayed evaluation-like effect up front and the lazy accumulation of SQL commands behind the scenes -- all without requiring the user to code with strings or use macros (One would of course need an analogous TableName type).

This is awesome!

davidagold commented 9 years ago

@jrevels Thank you for your enthusiastic feedback =) You raise a very good point about the possible boxing issues, and I think this does indeed give some grief to the current design. Comparing the results of @time for the following two functions shows as much ("ab" is just a table with two columns, A and B, that each contain 100 random Float64 values):

db = DB("db/AB.db")
ab = query(db, "select * from ab") |> Table
X = rand(100)

function f(tab::Table)
    x = 0.0
    for i in 1:nrows(ab)
       x += tab[i, A]
    end
    x
end

function h(X::Array)
    x = 0.0
    for i in eachindex(X)
       x += X[i]
    end
    x
end

f(ab)
h(X)

julia> @time f(ab)
  0.000023 seconds (105 allocations: 1.734 KB)
44.5651422074791

julia> @time g(X)
  0.000004 seconds (5 allocations: 176 bytes)
52.8487047215383

The code_typed doesn't throw any flags w/r/t type_inference, so I suspect the discrepancy is due to what you suggest (I can post the code_typed from each, if folks are interested).

One solution that's a fairly more radical than the above is to have ColName{T} actually store the column of values itself, thus becoming more of a Column{T}. This does seem to allay the memory issues -- a rough implementation yields after warmup

julia> @time f(ab)
  0.000008 seconds (5 allocations: 176 bytes)
44.5651422074791

So, now the DataFrame is really just a wrapper for a number of Column{T} objects that have names in the global namespace. It would be used mostly to coordinate wholesale interactions with the columns, e.g. any row-wise interaction with the DataFrame. For safety reasons, trying to setindex! into a Column would throw an error.

Also! I've got a rough implementation of (near) stringless delayed evaluation/lazy accumulation of SQL commands via do block syntax going:

julia> using Nierika

julia> db = DB("db/AB.db")
Nierika.DB("db/AB.db",Ptr{Void} @0x00007f98656adfb0,0)

julia> tab = select(db, "ab") do
           where(A > .5)
       end |> Table
45x3 Nierika.Table:
  2  0.951741  0.872029 
  5  0.593651  0.983585 
  7  0.552128  0.367959 
  8  0.597448  0.874415 
  9  0.629397  0.0549385
 10  0.556022  0.676691 
# ... rest of entries suppressed

the implementation actually doesn't use ColNames at all, which is nice since it means that we needn't introduce them into the namespace until the last minute at which one wants to materialize a Julia DataFrame. Rather, it uses the anonymous function returned by the do block as a vehicle for unevaluated code, snags that code using Base.uncompressed_ast (thanks @timholy !), picks out the arguments of the Expr in which where is called, and then stringifies them into SQL code.

On the one hand, I think this is really neat, and it's nice to be able to use Julia's built-in organization of Expr objects for parsing purposes. On the other hand, I am wary of introducing a syntax that lets users interact with columns in a database as though they were actual Julia objects. It's also kind of an abuse of the do syntax, but I feel less bad about that.

davidagold commented 9 years ago

@jrevels I've actually grown a little disenchanted with the global column names approach, since there's no way to control for the situation when a column name shadows a common imported name from Base (or any other module) such as count. I think your/Simon's idea is probably the way to go. It occurs to me now that the indexing needn't be as cumbersome as originally thought. Once Tuple{A} becomes expressible as {A}, one ought to be able to make your original scheme work with indexing looking like df[1, {:numbers}]. A little more cumbersome than the status quo, but arguably worth it. And the braces amortize when indexing over multiple columns: df[1, {:numbers, :letters, :emojis}]. Conceptually, it's fairly clear: all column references go inside the braces.

jrevels commented 9 years ago

I've actually grown a little disenchanted with the global column names approach, since there's no way to control for the situation when a column name shadows a common imported name from Base (or any other module) such as count.

Good point. The main benefit of your approach, for me, was that using type assertions allowed for a great reduction in JIT overhead, though I suppose this:

I imagine that the average user won't be generating so many different types of DataFrames in a given session that it will ever be an issue (though I could be wrong in this assumption).

I tend to agree. I'd be surprised if the number of distinct DataFrames exceeded 100 in most programs, although I'm sure somebody will attempt to build a million column DataFrame by incremental calls to hcat.

...will still hold true in the face of the changes currently underway.

Once Tuple{A} becomes expressible as {A}, one ought to be able to make your original scheme work with indexing looking like df[1, {:numbers}].

I've been waiting for JuliaLang/julia#8470 to be implemented for a while now, +1 to this.

If JuliaLang/julia#1974 gets resolved in favor of overloadable field access syntax, you might even be able to get away with something similar to this:

df.{:numbers}[1]

which I like but may be a bit "out there" in terms of fitting with other Julia code stylistically.

davidagold commented 9 years ago

The thing about the df.{:numbers}[1] syntax is that doesn't appear any analogous translation of df[1, {:numbers, :letters}], so we'd need to support the latter syntax anyway. FWIW I don't mind the style -- my main question would be whether or not it would be giving too many ways of doing the same thing, especially if we find a way to make DataFrame/Table reification convenient and performant.

EDIT: Oh wait, of course there is. Sometimes I forget just how powerful things like field and call overloading can be.

simonbyrne commented 9 years ago

At the moment, I think the @generated approach is going to be too punishing on the type system: trying it out on a DataFrame with a few hundred columns causes a notable JIT delay (10s of seconds) for every operation. Until we have generated-generated functions which can partially pre-compile, this is likely to be too slow (and get us disapproving looks from Jeff and Jameson).

I think the ColName{T} approach could be made to work, and to play nice with things like SQLite.jl. I actually quite like the idea that variables would be "global" (i.e. lastname is always a UTF8String). To avoid conflicting with variable names, we could define a macro string literal, i.e. so that columns are referred to by d"lastname"?

davidagold commented 9 years ago

@simonbyrne are you suggesting that d"lastname" expand to some internal name for the Column{T} object?

Maybe this would be excessive, but perhaps one could generate a new string macro for each DataFrame object and use that for column access. So instantiating df = DataFrame( ... ) would create a string macro df_str and then df"first_name, last_name" could expand to df[_first_name, _last_name], where _first_name, _last_name are internal names for the Column{T} objects.

Actually, in the latter approach you may as well just splice in the names of the relevant columns and do column access at compile time.

alyst commented 9 years ago

How would ColName{T} approach work if there are several data frames in the session with some overlapping column names, but not the types?

I wonder what is the bottleneck in @generated approach. Is it because e.g. join() have to be recompiled for each data frame pair? Would it help if join(a::AbstractDataFrame, b::AbstractDataFrame) would be just a wrapper to join_impl(a::ANY, b::ANY)? Or it's because the generation of new data frame type takes a lot of time?

davidagold commented 9 years ago

@alyst I think the bottleneck is just the sheer amount of methods that need to be compiled -- at least one for each column name (depending on how accessing multiple columns is parsed). EDIT: Nevermind this first part of the comment -- I didn't understand that @alyst 's question was about the combinatoric issues of method proliferation.

I don't think there is a way, under the Column{T} scheme, to have distinct DataFrames share a column name but differ on the type of the values stored in the column. One might be tempted to try something like Column{Tuple{:last_name, T}}, but then you need to re-assign last_name every time a new DataFrame with that field is initialized, or every time you add such a field to an extant DataFrame. Aside from being a logistical mess, that would also preclude all benefits to be gained by declaring the relevant Column{T} object to be constant.

davidagold commented 9 years ago

I'll add that I don't think the above is too large a drawback, especially if we give users tools to modify column names as they are importing data.

It's also worth noting that giving each DataFrame a custom string macro avoids the issue. However, it would require that metaprogramming be used to initialize a DataFrame, which I'm not super into.

simonster commented 9 years ago

The advantage to something based on @generated is that you can do the name lookup at compile time, so that something like:

for i = 1:size(df, 1)
    x += df[i, Field{:x}]
end

is fast. With the ColName{T} approach, that's harder. It may or may not be possible to write the name lookup in a way that LLVM can hoist it out of the loop, but it wouldn't be easy.

As @simonbyrne notes, it is quite problematic that the naive @generated approach would result in us compiling lots of code for each particular DataFrame type, which is really bad. But I think that using wrappers as @alyst proposed might actually turn out okay. In cases where the DataFrame type is known to type inference and the wrappers are small enough to be inlined, Julia won't actually need to compile them (although it will need to run type inference).

davidagold commented 9 years ago

@simonster I noticed that -- loops such as the above are slightly faster in the @generated approach.

Could the following be used to make something like that work for the ColName{T} approach? One would need (i) to make the DataFrame type parametric, where the parameter is a Symbol ID, (ii) to make the ColName type include a similar ID parameter (could just be the column name as a Symbol), and (iii) to store all the columns of all DataFrames in a separate object, say col_lib::Dict{Dict{Symbol, NullableVector}}. Then one could have

@generated function getindex{ID, T, C}(df::DataFrame{ID}, c::ColName{T, C})
    dfcols = col_lib[ID]
    col = dfcols[C]
    return :( $col )
end
davidagold commented 9 years ago

Oh wait, that totally incurs the same method proliferation problem as the other approach. Hmm.

On Thursday, September 24, 2015, Simon Kornblith notifications@github.com wrote:

The advantage to something based on @generated is that you can do the name lookup at compile time, so that something like:

for i = 1:size(df, 1) x += df[i, Field{:x}]end

is fast. With the ColName{T} approach, that's harder. It may or may not be possible to write the name lookup in a way that LLVM can hoist it out of the loop, but it wouldn't be easy.

As @simonbyrne https://github.com/simonbyrne notes, it is quite problematic that the naive @generated approach would result in us compiling lots of code for each particular DataFrame type, which is really bad. But I think that using wrappers as @alyst https://github.com/alyst proposed might actually turn out okay. In cases where the DataFrame type is known to type inference and the wrappers are small enough to be inlined, Julia won't actually need to compile them (although it will need to run type inference).

— Reply to this email directly or view it on GitHub https://github.com/JuliaStats/DataFrames.jl/issues/744#issuecomment-142981330 .