Closed tshort closed 5 years ago
Thanks for writing this up, Tom.
One point I'd make: we should try to decouple interface and implementation. Whether DataFrames are row-oriented or column-oriented shouldn't matter that much if we can offer tolerable performance for both iterating over rows and extracting whole columns. (Of course, defining tolerable performance is a tricky matter.) In particular, I'd like to see a DataFrames implementation that directly wraps SQLite3. Defining the interface at that level means that you can switch between SQLite3's in-memory database and something custom written for Julia readily depending on your particular applications performance characteristics.
I am thinking about adding a PostgresSQL client for Julia and I would like to expose a interface via DataFrames, so I am very much in favour of decoupling interface and implementation.
I would like to see an interface that takes into account datasets that are larger than the available memory on the client and require some sort of streaming.
A bit against-the-grain question: so if a goal of dataframe is to be an in-memory front end of a potentially much larger database (or a powerful engine), is performance of this thin layer that critical? It almost feels that flexibility and expressive power trumps raw performance in that case.
@tonyhffong I think the intention is that it can be both: there will be a pure julia DataFrame for general use, but you can also swap this for a different backend without changing code.
One other topic perhaps worth considering is indexing: in particular, what interface to use, and do we want pandas-style hierarchical indexing?
If dot overloading (JuliaLang/julia#1974) is implemented well, could df[1, :a]
be replaced by df[1].a
while still using @simonbyrne's staged-function tricks?
Since that would expand to getfield(df[1], Field{:a}())
you could certainly use the staged function trick in principle. But that depends heavily on how efficient df[1]
is as well. Data frame slices might be needed.
If we had "rerun type inference after inlining in some cases" from #3440, that would probably also be sufficient to make df[1, :a]
work. Or some hack to provide type information for arbitrary functions given the Exprs they were passed, similar to what inference.jl does for tupleref
.
I mentioned this on the mailing list, but some kind of @pure
notation that gives the compiler freedom to partially evaluate the function when it has compile-time-known arguments would also be sufficient – certainly for this and other indexing purposes, and perhaps others.
@one-more-minute, see JuliaLang/julia#414
This is just a framing point, but one way that I'd like to talk about this issue is in terms of "leaving money on the table", rather than in terms of performance optimization. A standard database engine gives you lots of type information that we're just throwing out. We're not exploiting knowledge about whether a column is of type T and we're also not exploiting knowledge about whether a column contains nulls or not. Many other design issues (e.g. row-oriented vs. column-oriented) depend on assumptions about how data will be accessed, but type information is a strict improvement (up to compilation costs and the risks of over-using memory from compiling too many variants of a function).
I tried to write down two very simple pieces of code that exemplify the very deep conceptual problems we need to solve if we'd like to unify the DataFrames and SQL table models of computation: https://gist.github.com/johnmyleswhite/584cd12bb51c27a19725
To reiterate @tonyhffong's point, I wonder, maybe naively, why one cannot use an SQLite in-memory database and an interface a la dplyr to carry out all the analyses. I have the impression that database engines have solved and optimised a lot of issues that we are trying to address here. Besides one of the main frustrations with R (at least mine) is the fact that large data sets cannot be handled directly. This would also remediate that issue.
I can foresee some limitations with this approach
Using SQLite3 as a backend is something that would be worth exploring. There are also lots of good ideas in dplyr.
That said, I don't really think that using SQLite3 resolves the biggest unsolved problem, which is how to express to Julia that we have very strong type information about dataframes/databases, but which is only available at run-time. To borrow an idea from @simonster, the big issue is how to do something like:
function sum_column_1(path::String)
df = readtable(path)
s = 0.0
for row in eachrow(df)
s += row.column_1
end
return s
end
The best case scenario I can see for this function is to defer compilation of everything after the call to readtable
and then compile the rest of the body of the function after readtable
has produced a concrete type for df
. There are, of course, other ways to achieve this effect (including calling a second function from inside of this function), but it's a shame that naive code like the following should suffer from so much type uncertainty that could, in principle, be avoided by deferring some of the type-specialization process.
Your example is what I meant by writing efficient custom functions. A possibility that I can see is to "somehow" compile Julia functions in SQLite user-defined functions. But this is probably cumbersome.
@johnmyleswhite Here is a python data interop protocol for working with external databases etc through blaze (numpy/pandas 2 ecosystem) http://datashape.pydata.org/overview.html
It is currently being used only to lower/ JIT expressions on numpy arrays, but facilitates interop and discovery with other backends: http://matthewrocklin.com/blog/work/2014/11/19/Blaze-Datasets/
Not sure if there are ideas here that can help in any way, but thought I would drop it in regardless.
These things are definitely useful. I think we need to think about they interact with Julia's existing static-analysis JIT.
Glad it is helpful. Here is the coordinating library that connects to these projects: https://github.com/ContinuumIO/blaze It itself has some good ideas for chunking, streaming etc
Here is a scheduler for doing out of core ops: http://matthewrocklin.com/blog/work/2015/01/16/Towards-OOC-SpillToDisk/
The graphs optimized to remove unnecessary computation https://github.com/ContinuumIO/dask/pull/20
Maybe after some introspection, dataframes can use blocks.jl to stream database into memory transparently. Does Julia have facilities to build and optimize parallel scheduling expression graphs?
@johnmyleswhite I ended up coding something that could be useful during your talk about this earlier today, and then I found this issue so I figured the most convenient option might be to discuss it here.
I came up with a rough sketch of a type-stable and type-specific implementation of DataFrames; a gist can be here. @simonbyrne's prototype ended up heavily informing how I structured the code, so it should look somewhat similar.
Pros for this implementation:
Cons:
Compat
could handle it? @dframe
constructor macro, while type-stable, does change the syntax a little bit compared to the current DataFrames(;kwargs...)
constructor...mainly, the fact that it's a macro and not a normal type constructor. Secondarily, when writing the kwargs pairs, one must actually add the colon to the key symbol: DataFrame(a=collect(1:10))
vs. @dframe(:a=collect(1:10))
. Note that, given the above implementation, it's also pretty easy to add type-stable/specific methods for getindex
that have the other indexing behaviors currently defined on DataFrames (e.g. df[i, Field{:s}]
, df[i]
).
Edit: Just added this to the gist for good measure. Seeing it in action:
julia> df = @dframe(:numbers = collect(1:10), :letters = 'a':'j')
DataFrame{Tuple{Array{Int64,1},StepRange{Char,Int64}},Tuple{:numbers,:letters}}(([1,2,3,4,5,6,7,8,9,10],'a':1:'j'),Tuple{:numbers,:letters})
julia> df[1]
10-element Array{Int64,1}:
1
2
3
4
5
6
7
8
9
10
julia> df[2, Field{:numbers}]
2
This implementation also easily supports row slicing:
julia> df[1:3, Field{:numbers}]
3-element Array{Int64,1}:
1
2
3
(...though I'm not sure whether the lack of support for row slicing is by design or not in the current DataFrames implementation)
Nice @jrevels! The main drawback I see is that type-stable indexing is still cumbersome. You need df[Field{:numbers}]
rather than df.numbers
.
@tshort True. I'm not sure that this implementation could ever support a syntax that clean, but I can naively think of few alternatives that could at least make it a little easier to deal with:
@field df.numbers
that expands to df[Field{:numbers}]
. This is still kind of annoying to write, but at least reads like the nice syntax you propose. Coded properly, you could write something like @field df.numbers[1] + df.numbers[2]
and the macro could expand each use of df.*
to df[Field{:*}]
. Field
type, e.g. abstract fld{f}
so that access looks like df[fld{:numbers}]
. This makes it a bit easier to type, but IMO makes it even harder to read, and is probably uglier than is acceptable.Field{f}
type. For example:julia> const numbers = Field{:numbers}
Field{:numbers}
julia> df[numbers]
10-element Array{Int64,1}:
1
2
3
4
5
6
7
8
9
10
This results in a nice syntax, but of course requires reserving a name for the constant. This could be done automatically as part of the @dframe
macro, but I think unexpectedly introducing new constants into the user's environment might be too intrusive.
Unless we specialise DataFrames into the language somehow (e.g., by making them act like Modules), I suspect that this sort of idea is likely to be the most useful.
The key downside from a performance perspective is going to be the JIT overhead every time you apply a function to a new DataFrame signature, though I'm not sure how important this is likely to be in practice.
This is great. I'm totally onboard with this.
My one source of hesitation is that I'm not really sure what performance problems we want to solve are. What I like about this is that it puts us in a position to use staged functions to make iterating over the rows of a DataFrame fast. That's a big win. But I think we still need radical changes to introduce indexing into DataFrames and even more work to provide support for SQL-style queries.
In an ideal world, we could revamp DataFrames with this trick while also exploring a SQLite-backed approach. If we're worried about labor resources, I wonder if we need to write out some benchmarks that establish what kinds of functionality we want to provide and what we want to make fast.
If you guys decide that it's worth it, I'd love to help refactor DataFrames to use the proposed implementation. I'd just have to get up to speed with where the package is at, given that such a refactor might also entail breaking API changes. If you think that DataFrames isn't ready, or that a large refactor is untenable given the SQL-ish direction that you want to eventually trend towards, that's also cool. Once decisions have been made as to what should be done, feel free to let me know how I can help.
The key downside from a performance perspective is going to be the JIT overhead every time you apply a function to a new DataFrame signature, though I'm not sure how important this is likely to be in practice.
I imagine that the average user won't be generating so many different types of DataFrames in a given session that it will ever be an issue (though I could be wrong in this assumption).
I imagine that the average user won't be generating so many different types of DataFrames in a given session that it will ever be an issue (though I could be wrong in this assumption).
I tend to agree. I'd be surprised if the number of distinct DataFrames exceeded 100 in most programs, although I'm sure somebody will attempt to build a million column DataFrame by incremental calls to hcat
.
I still have the feeling that this is reinventing the wheel if the end goal is to have an SQL-like/style syntax!
I recently stumbled upon monetdb (https://www.monetdb.org/Home).
They have embedded R in the database (https://www.monetdb.org/content/embedded-r-monetdb) and developed a package to access from R via dplyr.
I know this is not "pure" Julia, but I could very well imagine that one can work with such a technology which would also enable to have user defined Julia functions "transformed" to database level functions. Next step would be to have something like this http://hannes.muehleisen.org/ssdbm2014-r-embedded-monetdb-cr.pdf
Would an idea like this be worth exploring?
@teucer The problem is that for Julia to generate efficient code, it needs to know the types of the input variables when compiling the functions. dplyr and its connections to databases are very interesting, but they don't solve the type-stability issue if we want fast Julia code operating on data frames.
@nalimilan The monetdb example is more interesting than dplyr's SQL generation because it utilizes R to produce native database UDF, if I understand correctly.
@nalimilan @datnamer yes that's exactly the point! All the heavy work will be done by the database itself.
There will be still some latency between Julia and the database though. In my opinion, the "zero copy integration" (last link) nicely addresses this point.
So it appears there's some debate as to whether or not folks want DataFrames to be a database interface, a purely Julian implementation, or some hybrid approach. I don't have enough knowledge to advocate for any of these cases - I was just linking the proposal I made because it fixes some type uncertainty issues that exist in the current DataFrames implementation.
@johnmyleswhite I did a little bit more work on the gist to come up with an indexing structure that makes sense for the implementation, and enables type-stable indexing of entire rows rather than just columns/elements. Seeing it in action:
julia> df = @dframe(:numbers = collect(0.0:0.1:0.4), :letters = 'a':'e')
DataFrame{Tuple{Array{Float64,1},StepRange{Char,Int64}},Tuple{:numbers,:letters}}(([0.0,0.1,0.2,0.3,0.4],'a':1:'e'),Tuple{:numbers,:letters})
julia> df[Field{:numbers}]
5-element Array{Float64,1}:
0.0
0.1
0.2
0.3
0.4
julia> df[Col{1}]
5-element Array{Float64,1}:
0.0
0.1
0.2
0.3
0.4
julia> df[:, Field{:numbers}] == df[Field{:numbers}]
true
julia> df[:, Col{1}] == df[Col{1}]
true
julia> df[3, Col{1}]
0.2
julia> df[3, Field{:numbers}]
0.2
julia> df[3:5, Col{1}]
3-element Array{Float64,1}:
0.2
0.3
0.4
julia> df[3:5, Field{:numbers}]
3-element Array{Float64,1}:
0.2
0.3
0.4
julia> df[3, :] # 3rd row
DataFrame{Tuple{Float64,Char},Tuple{:numbers,:letters}}((0.2,'c'),Tuple{:numbers,:letters})
julia> df[2:4, :] # 2:4 rows; has type typeof(df)
DataFrame{Tuple{Array{Float64,1},StepRange{Char,Int64}},Tuple{:numbers,:letters}}(([0.1,0.2,0.3],'b':1:'d'),Tuple{:numbers,:letters})
julia> df[:,:] # copy of df through slicing
DataFrame{Tuple{Array{Float64,1},StepRange{Char,Int64}},Tuple{:numbers,:letters}}(([0.0,0.1,0.2,0.3,0.4],'a':1:'e'),Tuple{:numbers,:letters})
EDIT: fixed incorrect typing comment
I just realized that the above also supports indexing row by names if you use Dict
s as your columns:
julia> df = @dframe(:first = Dict('a' => 1, 'b' => 2), :second => Dict('a' => 3.0, 'b' => 4.5))
DataFrame{Tuple{Dict{Char,Int64},Dict{Char,Float64}},Tuple{:first,:second}}((Dict('b'=>2,'a'=>1),Dict('b'=>4.5,'a'=>3.0)),Tuple{:first,:second})
julia> df['b', :]
DataFrame{Tuple{Int64,Float64},Tuple{:first,:second}}((2,4.5),Tuple{:first,:second})
julia> df['b', Field{:first}]
2
Nifty!
On the above: A more serious implementation of name-based row indexing would obviously want to avoid storing duplicate keys:
type NamedRowDataFrame{K,C<:Tuple,F}
df::DataFrame{C,F}
rows::Dict{K,Int} # provides row name --> row index map
end
A few quick thoughts:
DataFrameSubset
and DataFrameRow
for subsetting and selecting rows (though this should not be needed for columns).I certainly think the lazy accumulation of transformations is the way to go. That coupled with some of the work that Jacob Quinn showed today on SQLite.jl + Tables.jl would get us into a vastly better state than we're in now.
@simonbyrne @johnmyleswhite The first point seems indeed to be the trend.
While it is true that dplyr is providing a unified syntax to deal with the datasources, this is, in my opinion, not the big novelty here. The bigger change that I see is that now R is more and more embedded in the databases and has become a first class citizen at the database level:
What I like about monetdb is that it is a column store and potentially offers big performance gains for aggregation and filtering (at least in my use cases I don't often have situation where I have to add new rows to that data.frames): http://stackoverflow.com/a/980941/279497
SQLite already has "embedding" of Julia in the DB.
I've been experimenting with another scheme for type-certain indexing. It introduces a new ColType{T}
immutable, which is used instead of the DataFrame
object to communicate the return type of a call to getindex
. Essentially you have the following:
type DataFrame
columns::Vector{Vector}
end
immutable ColName{T}
idx::Int
end
@inline function Base.getindex{T}(df::DataFrame, i::Int, col::ColName{T})
return df.columns[col.idx][i]::T
end
The above is of course just one way to organize the columns
field -- one could use a Dict
or even a Matrix{Any}
, which would still yield type-certain indexing. The latter option may help with memory layout, though it may use more memory on the whole because of the Any
.
This scheme has the potential to be very user friendly, at a cost. If one is willing to use eval(current_module(), ...)
to assign globally the names of the columns to ColName{T}
objects, then one can achieve very natural, type-certain indexing. The cost, aside from using eval
and assigning names in the user's namespace, is that one has to be careful when working with multiple DataFrames
not to use the same column name in two tables unless the columns have the same position in both tables and are of the same eltype
.
I've been playing with these ideas in (warning! rough code ahead) this branch of a repo hosting an experimental Table
type that can be materialized from an (equally experimental) abstract interface to a SQLite3 backend. I've essentially lifted the SQLite3 interface from @quinnj 's SQLite.jl. Here's what it looks like in action so far:
julia> using Nierika
julia> db = DB("db/test.db")
Nierika.DB("db/test.db",Ptr{Void} @0x00007faa378b4a10,0)
julia> tab = select(db, "person") do
where("id > 1")
end |> Table
2x4 Table with columns
(id, first_name, last_name, age):
2 "Santa" "Clause" 400
3 "Barack" "Obama" 53
julia> tab[2, last_name]
"Obama"
julia> @code_warntype tab[2, last_name]
Variables:
tab::Nierika.Table
i::Int64
col::Nierika.ColName{UTF8String}
Body:
begin
$(Expr(:meta, :inline)) # /Users/David/.julia/v0.4/Nierika/src/table.jl, line 35:
return (top(typeassert))((Nierika.getindex)((top(getfield))(tab::Nierika.Table,:columns)::Any,i::Int64,(top(getfield))(col::Nierika.ColName{UTF8String},:idx)::Int64)::Any,T)::UTF8String
end::UTF8String
julia> last_name
Nierika.ColName{UTF8String}(3,"last_name")
Another potential benefit of using ColName
s is that by overloading comparison operators for ColName
and other arguments, one could eventually drop the need for the string arguments in the select
and where
methods illustrated above and just have
select(db, person) do
where(id > 1)
end |> Table
which allows for a delayed evaluation-like effect up front and the lazy accumulation of SQL commands behind the scenes -- all without requiring the user to code with strings or use macros (One would of course need an analogous TableName
type).
I don't know if the pros to this approach can outweigh the cons, but I find it interesting and intend to follow it for the time being. Any and all thoughts are appreciated!
This scheme has the potential to be very user friendly, at a cost. If one is willing to use eval(current_module(), ...) to assign globally the names of the columns to ColName{T} objects, then one can achieve very natural, type-certain indexing.
My and @simonbyrne's implementations also support this idea, but I believe your implementation is way cooler than the ones posited so far for a few reasons that haven't yet been explicitly listed:
ColName
) rather than being split among different typesI have a question about this definition, though:
type DataFrame columns::Vector{Vector} end
Type-certain indexing is granted by the new getindex
definition, and avoiding full exposure of type information solves the JIT overhead issue, as mentioned. However, could the ambiguity of Vector{Vector}
(or Matrix{Any}
, Dict
, etc.) cause any issues with other functions, or perhaps memory boxing issues?
Another potential benefit of using ColNames is that by overloading comparison operators for ColName and other arguments, one could eventually drop the need for the string arguments in the select and where methods illustrated above and just have
select(db, person) do where(id > 1) end |> Table
which allows for a delayed evaluation-like effect up front and the lazy accumulation of SQL commands behind the scenes -- all without requiring the user to code with strings or use macros (One would of course need an analogous TableName type).
This is awesome!
@jrevels Thank you for your enthusiastic feedback =) You raise a very good point about the possible boxing issues, and I think this does indeed give some grief to the current design. Comparing the results of @time
for the following two functions shows as much ("ab"
is just a table with two columns, A
and B
, that each contain 100 random Float64
values):
db = DB("db/AB.db")
ab = query(db, "select * from ab") |> Table
X = rand(100)
function f(tab::Table)
x = 0.0
for i in 1:nrows(ab)
x += tab[i, A]
end
x
end
function h(X::Array)
x = 0.0
for i in eachindex(X)
x += X[i]
end
x
end
f(ab)
h(X)
julia> @time f(ab)
0.000023 seconds (105 allocations: 1.734 KB)
44.5651422074791
julia> @time g(X)
0.000004 seconds (5 allocations: 176 bytes)
52.8487047215383
The code_typed
doesn't throw any flags w/r/t type_inference, so I suspect the discrepancy is due to what you suggest (I can post the code_typed
from each, if folks are interested).
One solution that's a fairly more radical than the above is to have ColName{T}
actually store the column of values itself, thus becoming more of a Column{T}
. This does seem to allay the memory issues -- a rough implementation yields after warmup
julia> @time f(ab)
0.000008 seconds (5 allocations: 176 bytes)
44.5651422074791
So, now the DataFrame
is really just a wrapper for a number of Column{T}
objects that have names in the global namespace. It would be used mostly to coordinate wholesale interactions with the columns, e.g. any row-wise interaction with the DataFrame
. For safety reasons, trying to setindex!
into a Column
would throw an error.
Also! I've got a rough implementation of (near) stringless delayed evaluation/lazy accumulation of SQL commands via do
block syntax going:
julia> using Nierika
julia> db = DB("db/AB.db")
Nierika.DB("db/AB.db",Ptr{Void} @0x00007f98656adfb0,0)
julia> tab = select(db, "ab") do
where(A > .5)
end |> Table
45x3 Nierika.Table:
2 0.951741 0.872029
5 0.593651 0.983585
7 0.552128 0.367959
8 0.597448 0.874415
9 0.629397 0.0549385
10 0.556022 0.676691
# ... rest of entries suppressed
the implementation actually doesn't use ColName
s at all, which is nice since it means that we needn't introduce them into the namespace until the last minute at which one wants to materialize a Julia DataFrame
. Rather, it uses the anonymous function returned by the do
block as a vehicle for unevaluated code, snags that code using Base.uncompressed_ast
(thanks @timholy !), picks out the arguments of the Expr
in which where
is called, and then stringifies them into SQL code.
On the one hand, I think this is really neat, and it's nice to be able to use Julia's built-in organization of Expr
objects for parsing purposes. On the other hand, I am wary of introducing a syntax that lets users interact with columns in a database as though they were actual Julia objects. It's also kind of an abuse of the do
syntax, but I feel less bad about that.
@jrevels I've actually grown a little disenchanted with the global column names approach, since there's no way to control for the situation when a column name shadows a common imported name from Base (or any other module) such as count
. I think your/Simon's idea is probably the way to go. It occurs to me now that the indexing needn't be as cumbersome as originally thought. Once Tuple{A}
becomes expressible as {A}
, one ought to be able to make your original scheme work with indexing looking like df[1, {:numbers}]
. A little more cumbersome than the status quo, but arguably worth it. And the braces amortize when indexing over multiple columns: df[1, {:numbers, :letters, :emojis}]
. Conceptually, it's fairly clear: all column references go inside the braces.
I've actually grown a little disenchanted with the global column names approach, since there's no way to control for the situation when a column name shadows a common imported name from Base (or any other module) such as count.
Good point. The main benefit of your approach, for me, was that using type assertions allowed for a great reduction in JIT overhead, though I suppose this:
I imagine that the average user won't be generating so many different types of DataFrames in a given session that it will ever be an issue (though I could be wrong in this assumption).
I tend to agree. I'd be surprised if the number of distinct DataFrames exceeded 100 in most programs, although I'm sure somebody will attempt to build a million column DataFrame by incremental calls to hcat.
...will still hold true in the face of the changes currently underway.
Once Tuple{A} becomes expressible as {A}, one ought to be able to make your original scheme work with indexing looking like df[1, {:numbers}].
I've been waiting for JuliaLang/julia#8470 to be implemented for a while now, +1 to this.
If JuliaLang/julia#1974 gets resolved in favor of overloadable field access syntax, you might even be able to get away with something similar to this:
df.{:numbers}[1]
which I like but may be a bit "out there" in terms of fitting with other Julia code stylistically.
The thing about the FWIW I don't mind the style -- my main question would be whether or not it would be giving too many ways of doing the same thing, especially if we find a way to make DataFrame/Table reification convenient and performant.df.{:numbers}[1]
syntax is that doesn't appear any analogous translation of df[1, {:numbers, :letters}]
, so we'd need to support the latter syntax anyway.
EDIT: Oh wait, of course there is. Sometimes I forget just how powerful things like field and call overloading can be.
At the moment, I think the @generated
approach is going to be too punishing on the type system: trying it out on a DataFrame with a few hundred columns causes a notable JIT delay (10s of seconds) for every operation. Until we have generated-generated functions which can partially pre-compile, this is likely to be too slow (and get us disapproving looks from Jeff and Jameson).
I think the ColName{T}
approach could be made to work, and to play nice with things like SQLite.jl. I actually quite like the idea that variables would be "global" (i.e. lastname
is always a UTF8String
). To avoid conflicting with variable names, we could define a macro string literal, i.e. so that columns are referred to by d"lastname"
?
@simonbyrne are you suggesting that d"lastname"
expand to some internal name for the Column{T}
object?
Maybe this would be excessive, but perhaps one could generate a new string macro for each DataFrame
object and use that for column access. So instantiating df = DataFrame( ... )
would create a string macro df_str
and then df"first_name, last_name"
could expand to df[_first_name, _last_name]
, where _first_name
, _last_name
are internal names for the Column{T}
objects.
Actually, in the latter approach you may as well just splice in the names of the relevant columns and do column access at compile time.
How would ColName{T}
approach work if there are several data frames in the session with some overlapping column names, but not the types?
I wonder what is the bottleneck in @generated
approach. Is it because e.g. join()
have to be recompiled for each data frame pair? Would it help if join(a::AbstractDataFrame, b::AbstractDataFrame)
would be just a wrapper to join_impl(a::ANY, b::ANY)
? Or it's because the generation of new data frame type takes a lot of time?
@alyst I think the bottleneck is just the sheer amount of methods that need to be compiled -- at least one for each column name (depending on how accessing multiple columns is parsed). EDIT: Nevermind this first part of the comment -- I didn't understand that @alyst 's question was about the combinatoric issues of method proliferation.
I don't think there is a way, under the Column{T}
scheme, to have distinct DataFrame
s share a column name but differ on the type of the values stored in the column. One might be tempted to try something like Column{Tuple{:last_name, T}}
, but then you need to re-assign last_name
every time a new DataFrame
with that field is initialized, or every time you add such a field to an extant DataFrame
. Aside from being a logistical mess, that would also preclude all benefits to be gained by declaring the relevant Column{T}
object to be constant.
I'll add that I don't think the above is too large a drawback, especially if we give users tools to modify column names as they are importing data.
It's also worth noting that giving each DataFrame
a custom string macro avoids the issue. However, it would require that metaprogramming be used to initialize a DataFrame
, which I'm not super into.
The advantage to something based on @generated
is that you can do the name lookup at compile time, so that something like:
for i = 1:size(df, 1)
x += df[i, Field{:x}]
end
is fast. With the ColName{T}
approach, that's harder. It may or may not be possible to write the name lookup in a way that LLVM can hoist it out of the loop, but it wouldn't be easy.
As @simonbyrne notes, it is quite problematic that the naive @generated
approach would result in us compiling lots of code for each particular DataFrame type, which is really bad. But I think that using wrappers as @alyst proposed might actually turn out okay. In cases where the DataFrame type is known to type inference and the wrappers are small enough to be inlined, Julia won't actually need to compile them (although it will need to run type inference).
@simonster I noticed that -- loops such as the above are slightly faster in the @generated
approach.
Could the following be used to make something like that work for the ColName{T}
approach? One would need (i) to make the DataFrame
type parametric, where the parameter is a Symbol
ID, (ii) to make the ColName
type include a similar ID parameter (could just be the column name as a Symbol
), and (iii) to store all the columns of all DataFrames
in a separate object, say col_lib::Dict{Dict{Symbol, NullableVector}}
. Then one could have
@generated function getindex{ID, T, C}(df::DataFrame{ID}, c::ColName{T, C})
dfcols = col_lib[ID]
col = dfcols[C]
return :( $col )
end
Oh wait, that totally incurs the same method proliferation problem as the other approach. Hmm.
On Thursday, September 24, 2015, Simon Kornblith notifications@github.com wrote:
The advantage to something based on @generated is that you can do the name lookup at compile time, so that something like:
for i = 1:size(df, 1) x += df[i, Field{:x}]end
is fast. With the ColName{T} approach, that's harder. It may or may not be possible to write the name lookup in a way that LLVM can hoist it out of the loop, but it wouldn't be easy.
As @simonbyrne https://github.com/simonbyrne notes, it is quite problematic that the naive @generated approach would result in us compiling lots of code for each particular DataFrame type, which is really bad. But I think that using wrappers as @alyst https://github.com/alyst proposed might actually turn out okay. In cases where the DataFrame type is known to type inference and the wrappers are small enough to be inlined, Julia won't actually need to compile them (although it will need to run type inference).
— Reply to this email directly or view it on GitHub https://github.com/JuliaStats/DataFrames.jl/issues/744#issuecomment-142981330 .
@johnmyleswhite started this interesting email thread:
https://groups.google.com/forum/#!topic/julia-dev/hS1DAUciv3M
Discussions included:
df[1, Field{:a}()]
instead ofdf[1, :a]
.field"a"
, could make the column indexing look a bit better for the example above.df.a
but not asdf[:a]
(same issue as Simon's approach).The following issues in Base may help with type certainty:
a.b