Closed tshort closed 5 years ago
So, I hit on a very similar problem to this yesterday and came up with something you lot might be interested in – although I'm not yet sure if it's actually usable or just a horrible hack. Gist here, use at your own risk, may set Julia on fire etc.
As in Simon/Jarrett's implementations, we want to hold as much type info as possible so that we can help the compiler, but the trick is to actively hide that information from the type system until we know it will be useful. Take:
d = TypedDict(f"foo"=>1, f"bar"=>"foo")
d[f"foo"]
(You can of course use symbols as with regular Dicts, but using the field string macro makes both the constructor and the indexing type-inferrable.) The type of the dict is TypedDict{I}
where I
is a symbol – the type information is associated with the symbol in a way that the staged functions can retrieve, but since it's opaque to the compiler we're no longer trying to do inference on Big Data type tags.
The neat part of this is that the same symbol can share type info for many different objects. Operations on TypedDict
s which add or remove fields can generally return a TypedDict
of the exact same type (i.e. the same symbol parameter), which means no crazy proliferation of generated code. The naive user who tries to hcat
a million-column dataframe will be fine.
Potential caveats:
Any thoughts? I'll probably play around with this a bit more and give an update if there's any particular issues or successes.
This is an interesting idea, but I found a case that might thwart this approach for our purposes. For example, bouncing off of your code:
julia> function Base.hcat(a::TypedDict, b::TypedDict) # naive hcat
return TypedDict(merge(a.values, b.values)...)
end
hcat (generic function with 14 methods)
julia> a = TypedDict(f"foo"=>1, f"bar"=>"foo")
TypedDict{symbol("##ti#7555")}(Dict{Symbol,Any}(:bar=>"foo",:foo=>1))
julia> b = TypedDict(f"baz"=> 23.21)
TypedDict{symbol("##ti#7555")}(Dict{Symbol,Any}(:baz=>23.21))
julia> function f(a, b)
d = hcat(a, b)
return d[f"baz"]
end
f (generic function with 1 method)
julia> @code_warntype f(a, b)
Variables:
a::TypedDict{symbol("##ti#7555")}
b::TypedDict{symbol("##ti#7555")}
d::Any
Body:
begin # none, line 2:
d = (top(_apply))((top(getfield))(Main,:call)::F,Main.TypedDict,(Main.merge)((top(getfield))(a::TypedDict{symbol("##ti#7555")},:values)::Dict{Symbol,Any},(top(getfield))(b::TypedDict{symbol("##ti#7555")},:values)::Dict{Symbol,Any})::Dict{Symbol,Any})::Any # none, line 3:
return (Main.getindex)(d,$(Expr(:new, Field{:baz})))::Any
end::Any
So, ignoring the issue of ordering (we could just make the DataFrames version of TypedDict
hold an OrderedDict
of Vectors
), I can bork inferencing on the indexing operations simply by preceding it with an hcat
. Now, I wrote that hcat
really naively, so maybe there's a better way to write it that would help?
My fear is that, under this approach, any function f(::DataFrame) --> DataFrame
could throw a wrench into type inferencing, unless the caller properly manipulates the internals (e.g. write generated functions to handle it). I might be being overly pessimistic, though.
any function f(::DataFrame) --> DataFrame could throw a wrench into type inferencing
Only a relatively narrow class of functions, i.e. those that manipulate the internals of the type rather than going through an external interface. So your hcat
/merge
would have to do the right type gymnastics, but once hcat
is inferable its caller doesn't have to do anything special. (I think that's the same as the other prototypes here, right?)
Happy to prototype stuff like hcat
if it's useful.
I'm traveling atm, so I can't test this out, but here's the general issue I was thinking about:
function f(df::DataFrame) df2 = g(df) # returns DataFrame return h(df2) # returns whatever end
Even if g and h are type certain in a global context by way of using generated functions, the fact that the type information is "boxed" could potentially prevent the return type of f from being properly inferred. Like I said, I could be wrong, I'd have to see it in action to be sure.
The other "hide type info from the compiler" approaches may also have this problem (if it indeed exists and I'm not blatantly wrong).
Never mind, just got a chance to test this out, and I think proved myself wrong. I believe I understand why I was wrong too. Apologies for derailing with non-issues.
You can pull off such crazy stuff with generated functions, it throws me for a loop sometimes...
Glad to have an extra set of eyes on this either way :) I think your basic concern about generated functions proliferating everywhere is a very valid one, but it's hard to know how that will play out without experimenting a bit more, I think.
So would the idea be that the type parameter symbol is hashed from the fields & types?
Sort of. The symbol stores an arbitrary set of type information Symbol=>Type
and any TypedDict whose key/value set matches a subset of those types can use that key for type info. That means that adding new keys to the dict, for example, will usually reuse the same key, which minimises the number of types that Julia has to specialise for.
I've been following this thread with great interest for a while now, but just recently started to run into this problem in a very real way in my own packages. I'm still not totally decided on a solid interface, but I feel like I've learned a couple of things along the way.
NA
. NullableArrays is clearly the path forward and I'd encourage a swift switch to make DataFrames depend on it by default.DataTable
is limited to strong typing for Int
, Float64
, strings, Date
, and DateTime
, all other columns get lumped in a Any
bucketkernelmethod{T}(col::NullableVector{T})
, or one method for each supported type (Int, Float64, etc.).On the need for "type kernel" methods, I'm starting to feel quite certain there's no amount of "hacking" that will get around the need for them. I think it will take an extra compiler optimization as mentioned in https://github.com/JuliaLang/julia/issues/5654, or some other kind of generated types approach. The key missing optimization is a kind of intelligence during type inference that when you pull a parameterized element out of another object, it's necessary to have the ability to further type-infer code on the extracted parameterized element; the simplest case I have is a field in an object like Vector{Vector{T}}
. Each individual Vector{T}
will have a different T
and would end up with slightly different code patterns depending on such. Currently, however, type inference can only see Vector{T}
as the inferred type and no further specialization is done, despite the fact that there will be some kind of specialized T
at run-time.
I'd appreciate any feedback on DataTable
; it's by no means my "replacement" for DataFrames. On the contrary, I'd much rather see progress in DataFrames and type inference. In the mean time, if anyone wants to help me hack on it, I'd appreciate the help.
What does the restriction (in term of Int, Float64, Date etc) allow in term of type stability?
It seems like there are two issues with the current proposals: the syntax for type stable indexing is cumbersome, and each new dataframe generates a new type.
I'm wondering if the following works: keep the current DataFrame
type by default and — only when needed — create a fast iterator on dataframe rows with eachrow(df)
. eachrow
would return something similar to the specialized dataframe type proposed above. The syntax would be something like:
function f(eachrow::EachRow)
out = 0
for (v1, v2) in eachrow
out += v2 > 0 ? v1 : 0
end
df = DataFrame(x = 1:2, y = [0.1, ,3.0], z = 2:3)
f(eachrow(df[[:x, :y]]))
I don't think that would work, unfortunately. The type of the iterator can't be inferred either, since it depends on the DataFrame
argument whose type doesn't provide information on the types of the columns.
yeah sorry i updated my post. I think this one works right?
I don't understand your new post: does it need a data frame specialized on the column types or not? You seem to imply that as well as the contrary.
For a quick summary, see @johnmyleswhite's post: http://www.johnmyleswhite.com/notebook/2015/11/28/why-julias-dataframes-are-still-slow/
Gist here, use at your own risk, may set Julia on fire etc.
@MikeInnes - this link is now broken, but I'm rather interested in seeing the direction you were taking. Do you still have the code around somewhere?
I'd also like to check out the gist.
Sure, you can actually see the whole code at Data.jl and I put up a self-contained gist here. Data.jl is basically just a fleshed-out version where the full DataFrame object is built on top of the TypedDict.
Honestly though, when I was playing around with this I'm not sure the type inferrability of dataframe indexing made a whole lot of difference. You end up having to factor operations out into acting on columns anyway (see e.g. DecisionTrees.jl).
For me the biggest wins came from (1) replacing DataArray
with Vector{Nullable}
, (2) specialising columns where appropriate (removing Nullable
, pooled arrays etc.) (3) writing fast column-level kernels for high-level operations. That gets you performance on par with scikit-learn without having to mess around with matrices and strange data encodings.
We should probably document the ways in which non-concrete types harm the performance of code written for DataFrames until we have a better design that we can all agree on. Each of the problems that comes up is very different from the others.
df = get_data_frame()
for r in eachrow(df)
do_something_per_row(r)
end
do_something_to_whole_table(df)
.df = get_data_frame()
do_something_to_whole_table(df)
Union{T, NAtype}
. These scalar values get boxed and computations on them go through dynamic dispatch, so the computations are much slower than you'd get if you worked with a vector whose elements were all a primitive type like Float64
.df = get_data_frame()
do_something_to_columns(df[:a], df[:b], df[:c])
function do_something_to_columns(a, b, c)
s = 0.0
for i in 1:length(a)
s += a[i] + b[i] + c[i]
end
return s
end
@johnmyleswhite What is a "kernel function" in this context?
It's possible I don't understand what other people mean, but I'm using kernel function to refer to a function f(args...)
all of whose arguments are the columns of a DataFrame after extracting them and resolving any type-uncertainty. This function is where the core computational work happens; as long as it's fast, the rest of the system has acceptable performance.
Hi everyone,
At work we need to manage data with multiple types and I've also been working on how to do typed data containers in Julia and have created a reasonably functional package called Tables.jl.
My approach is a little more complex (and heavy on metaprogramming) than the example given by @MikeInnes. But it also includes a first-class Row
tuple-wrapper which actually makes dealing with data in row chunks fast (@johnmyleswhite - in Julia we can have our cake and eat it). Or you can extract the columns / raw data / etc as necessary.
I have seen recent additions to Julia 0.5 that fix a few speed problems with tuples, so this should become a reasonably fast approach (it should already be better than DataFrames with no care taken with regards to type stability, but I haven't done any in-depth benchmarking beyond inspecting the generated llvm/native code). Furthermore, using the new @pure
meta and factoring out some of the metaprogramming (into a separate typed-dict package or something) should simplify the code significantly and allow for static compilation in the future.
I would appreciate any feedback. I was also thinking of making a PR to METADATA to make it public, but I definitely didn't want to annoy or upset the JuliaStats community either, so I am also seeking comments regarding this (i.e. if there was soon-to-be-released something similar from you guys, or if there was vehement opposition to having multiple packages with some overlap in functionality, etc).
Great, @andyferris! Please feel free to register Tables. Looks like lots of good stuff in there. The API for Tables is wordy and noisy. That may be the price for type stability. To simplify, could @pure
allow tbl[:A]
to be treated like tbl[Val{:A}]
?
Are you planning conversion methods to DataFrames?
@tshort Thanks, and yes, that is quite possible. At the very least, I just tried
julia> function f(x::Symbol)
Base.@_pure_meta
Val{x}
end
f (generic function with 1 method)
julia> g() = f(:a)
g (generic function with 1 method)
julia> code_warntype(g,())
Variables:
#self#::#g
Body:
begin # none, line 1:
return $(QuoteNode(Val{:a}))
end::Type{Val{:a}}
in nightly. This seems promissing to me. At the very least, this doesn't seem possible in 0.4:
julia> @inline f2(x::Symbol) = Val{x}
f2 (generic function with 1 method)
julia> g2() = f2(:a)
g2 (generic function with 1 method)
julia> code_warntype(g2,())
Variables:
#self#::#g2
Body:
begin # none, line 1:
return (top(apply_type))(Main.Val,:a)::Type{_<:Val{T}}
end::Type{_<:Val{T}}
This change will make the interface for Tables.jl quite a bit nicer to use. Would be interesting if the constructors could be cleaned up too.
And yes methods to eat and spit out DataFrames is planned... at the very least that will provide I/O in the short-term.
Great news, @andyferris. I'd shoot for v0.5 then. That would make the "everyday" API much nicer.
Really interesting! I hadn't anticipated that @pure
would allow tbl[:A]
to be type-stable.
I have a few questions, which may well reflect my misunderstanding:
1) Why do @table
, @field
and @cell
need to be macros?
2) Likely related: Why are type annotations in @table(A::Int64=[1,2,3], B::Float64=[2.0,4.0,6.0])
or @cell(A::Int64=1)
required? I guess they can be useful if you want to force an abstract type to be used when passing a single value to @cell
, but shouldn't inference be enough in other cases?
3) I don't understand the signification/need for FieldIndex
. Shouldn't this information be stored as a Tuple{}
of field types in the table or row objects, instead of being exposed to the user?
(Finally, the name bikeshedding session: I think Tables.jl
is too general. I also had a package called that way, and I renamed it to FreqTables.jl
before registering it, as tables can mean very different things in different fields. How about DataTables.jl
, which is the name you use at the top of README.md? Anyway, probably better discuss this in the METADATA.jl
registration PR to keep the present thread focused.)
Thanks @nalimilan. Yes, the @pure
thing is pretty cool... I'm not certain yet if it is a "compiler hint" or follows a clear set of rules. Currently Julia's type inference system gives up and goes on holiday when, e.g., passing around large tuples and so-on. Fortunately, generated functions seem to cause it to not be lazy (for obvious reasons).
Let me address your questions
1) The macros are only for convenience. The inner constructor for the table is something like:
Table{FieldIndex{(Field{:A,Int64}(), Field{:B,Float64}())}(), Tuple{Int64,Float64}, Tuple{Array{Int64,1},Array{Float64,1}}}(([1,2,3], [2.0,4.0,6.0]))
Yuk, right? Anyway, I realize it is dreadful and some information can be computed (the element types from the array types, perhaps the fields could just be names, I don't know).
2) That was just my original design. A Field
combines a name and a type, just like the fields of type
and immutable
objects. The programmer designs a table with specific data in mind - and data can be converted upon input, or whatever, transparently. The type contract is never broken.
As for the macros, I wanted to allow both field names and Field
objects to be used, so the non-type-annotated version assumes a field already exists, like:
field_A = Field{:A,Int64}()
cell_1 = @cell(field_A = 42)
cell_2 = @cell(A::Int64 = 42) # == cell_1
I wasn't sure how to make both inputs work. DIfferent macros? Use a symbol, like @cell(:A = 42)
? I think the API could definitely be refined.
3) Great question. This was a design decision I made for a couple of reasons. First, I liked that I could write an object where some invariants are checked in the constructor (e.g. it won't allow two fields with the same name, or any called :Row
). Similarly, being a distinct type means I can dispatch many things with the type, such as call
with a tuple which creates a Row
or Table
depending on the types.
I believe this could be simplified, but some things might have to be checked later in the process. Also, working with Tuple types in 0.4 is a bit painful. I'm guessing there will be enough metaprogramming power in @pure
that some of these things can be done differently (and without generated functions).
Finally, regarding the package name, I shortened the module name myself when I got frustrated with typing a longer name during development (frequent reload
commands means I use import
not using
). Tables.jl, DataTables.jl and TypedTables.jl are all acceptable to me, but I plan bring this up in the PR.
Sorry for the late reply. I wonder why you wouldn't be able to simplify the type to this: Table{Tuple{:A, :B}, Tuple{Array{Int64,1}, Array{Float64, 1}}}
, i.e. a tuple type of column names and a tuple type of column types. This wouldn't prevent checking for invariants at all. Then Table(A=[1,2,3], B=[2.0,4.0,6.0])
could easily infer the type parameters from its arguments.
Finally, I don't think you need to store both the type of the columns and their element type: eltype
should always provide this information if you call it on the column type. This should help reduce the number of parameters.
Yes Milan, good idea and I agree and have thought about this lately, but I haven't got around to doing a type-design iteration yet. And yes the invariants can be checked - it's just a matter of when. I had included both the element types and storage types in the header because I was trying to see how far one could go without generated functions, because eltypes
(plural, not singular) fails to generate good type information on Julia 0.4 if it is not generated
. One could also use Julia 0.5 pure functions to fix that. But as it stands, most the type-stability (and zero run-time overhead) relies heavily on generated functions.
The other place the element type is useful is getting the correct function signature, for instance to push a Row
onto a Table
. But it is superfluous - one just moves the checks into the function (a generated/pure helper function should lead to zero run-time overhead).
Do you think Tuple{:A,:B}
is OK? Interestingly, it's a type that can't be instantiated and Steffan and/or Jeff once saw I had done that before and they thought it was a mistake that Julia even allows it (sorry I can't seem to find the reference to that discussion right now).
The other two possibilities are:
Table{Tuple{Val{:A}, Val{:B}}, Tuple{Array{Int64,1}, Array{Float64, 1}}}
or
Table{(:A, :B), Tuple{Array{Int64,1}, Array{Float64, 1}}}
I was considering the latter. A field name might be carried in a Val
of a symbol, and a collection of fields (currently FieldIndex
) is a Val
of a tuple of symbols. Or, I could make a new type like Name
and Names
for these, which would have many methods defined (like currently I can take the union of two FieldIndex
s - I would feel bad for defining a bunch of methods on Val
). Or, fields and indices can stay as they are, with full type information.
Your opinion?
I would go with the latter ((:A, :B)
), as indeed the column names are not types, they are just parameters. And I don't think Val
is supposed to be used that way, it's just a way to dispatch on a specific value when calling a function.
Indeed, I agree with you @nalimilan. I expect I'll go that way when I find the time. :)
As a general comment, I don't think there is much top-down imposition nor much community consensus on how Val
should be used. It is defined in Base
as immutable Val{T}; end
without comment or any methods defined for manipulating Val
s; furthermore there is AFAIK just a few comments online discussing its use (e.g. Val{T}()
vs Val{T}
). Certainly, Tuple{Val{x}}
can at least be instantiated as (Val{x}())
, so it is better than Val{:A,:B}
. Also compare using Val
-types vs other singletons in Base
like LinearFast()
which are passed as values not types.
Anyway, that is just an aside. I have some deeper questions for actual R-data.frames/data.table/dplyr and python-pandas users. I've come up with a macro @select which easily combines the notation of dplyr's select
and mutate
together (my Table
s are immutable, so mutate doesn't even make sense in Julia). It is used like:
@select(table, col1, newname = col2, newcol::newtype = col1 -> f(col1))
where it has three possible behaviours, respectively: selecting a column, renaming a column (and including it) and creating a new column via a row-by-row calculation. The syntax is quite constrained by the semantics of Julia, but any comments or suggested improvements would be welcome!
I tried to emulate dplyr
-like syntax since it seems to have the preferable syntax (compared to R's data.table, which people seem to prefer for speed not syntax). As such, I was hoping to implement @filter
, @arrange
etc. Suggestions and greedy desires from other data manipulators would be useful at this early stage!
@andyferris have you seen this? https://github.com/JuliaStats/DataFramesMeta.jl
Sounds like a good strategy. As @datnamer, I think the state of our reflections with regard to this is in DataFramesMeta (see also the issues in that project). It would make sense to follow SLQ/LINQ/dplyr as much as possible, unless there's a good reason not to.
There has also been some discussion regarding APIs here: https://github.com/JuliaStats/DataFrames.jl/issues/369.
This may be premature, but are there any thoughts on SQL code generation?
Edit: There is also this https://github.com/MikeInnes/Flow.jl. I wonder if that can help (query planning, delayed eval etc)
@datnamer: quick answer regarding SQL is "no". But thank you to both of you for interesting reads (I definitely had seen DataFramesMeta quite some time ago but hadn't thought about it lately).
Regarding #369, the point is just syntax where you write code in devectorized form? I think my new @select
macro should be fine for this since it uses a hidden comprehension, which allow for pretty generic code inside. The user can declare the element type but not the container type (so its hard to use NullableArrays or pooled data structures - I need to think more about this).
Regarding #369, the point is just syntax where you write code in devectorized form? I think my new @select macro should be fine for this since it uses a hidden comprehension, which allow for pretty generic code inside. The user can declare the element type but not the container type (so its hard to use NullableArrays or pooled data structures - I need to think more about this).
Well, mostly, but it would also allow you to refer to all existing columns as variables, as well as to create new ones like you would create variables. So it would take a complete code block instead of short expressions passed as separate arguments. See the example I provided there. It's particularly useful when combining several variables using if
to create a new one. Then there's also the question of propagating nulls automatically, which is a bit harder.
Closing as we're not going to make the DataFrame
type encode column type information in 1.0.
@johnmyleswhite started this interesting email thread:
https://groups.google.com/forum/#!topic/julia-dev/hS1DAUciv3M
Discussions included:
df[1, Field{:a}()]
instead ofdf[1, :a]
.field"a"
, could make the column indexing look a bit better for the example above.df.a
but not asdf[:a]
(same issue as Simon's approach).The following issues in Base may help with type certainty:
a.b