JuliaData / DataFrames.jl

In-memory tabular data in Julia
https://dataframes.juliadata.org/stable/
Other
1.74k stars 370 forks source link

DataFrames replacement that has no type-uncertainty #744

Closed tshort closed 5 years ago

tshort commented 9 years ago

@johnmyleswhite started this interesting email thread:

https://groups.google.com/forum/#!topic/julia-dev/hS1DAUciv3M

Discussions included:

The following issues in Base may help with type certainty:

MikeInnes commented 9 years ago

So, I hit on a very similar problem to this yesterday and came up with something you lot might be interested in – although I'm not yet sure if it's actually usable or just a horrible hack. Gist here, use at your own risk, may set Julia on fire etc.

As in Simon/Jarrett's implementations, we want to hold as much type info as possible so that we can help the compiler, but the trick is to actively hide that information from the type system until we know it will be useful. Take:

d = TypedDict(f"foo"=>1, f"bar"=>"foo")
d[f"foo"]

(You can of course use symbols as with regular Dicts, but using the field string macro makes both the constructor and the indexing type-inferrable.) The type of the dict is TypedDict{I} where I is a symbol – the type information is associated with the symbol in a way that the staged functions can retrieve, but since it's opaque to the compiler we're no longer trying to do inference on Big Data type tags.

The neat part of this is that the same symbol can share type info for many different objects. Operations on TypedDicts which add or remove fields can generally return a TypedDict of the exact same type (i.e. the same symbol parameter), which means no crazy proliferation of generated code. The naive user who tries to hcat a million-column dataframe will be fine.

Potential caveats:

Any thoughts? I'll probably play around with this a bit more and give an update if there's any particular issues or successes.

jrevels commented 9 years ago

This is an interesting idea, but I found a case that might thwart this approach for our purposes. For example, bouncing off of your code:

julia> function Base.hcat(a::TypedDict, b::TypedDict) # naive hcat
           return TypedDict(merge(a.values, b.values)...)
       end
hcat (generic function with 14 methods)

julia> a = TypedDict(f"foo"=>1, f"bar"=>"foo")
TypedDict{symbol("##ti#7555")}(Dict{Symbol,Any}(:bar=>"foo",:foo=>1))

julia> b = TypedDict(f"baz"=> 23.21)
TypedDict{symbol("##ti#7555")}(Dict{Symbol,Any}(:baz=>23.21))

julia> function f(a, b)
           d = hcat(a, b)
           return d[f"baz"]
       end
f (generic function with 1 method)

julia> @code_warntype f(a, b)
Variables:
  a::TypedDict{symbol("##ti#7555")}
  b::TypedDict{symbol("##ti#7555")}
  d::Any

Body:
  begin  # none, line 2:
      d = (top(_apply))((top(getfield))(Main,:call)::F,Main.TypedDict,(Main.merge)((top(getfield))(a::TypedDict{symbol("##ti#7555")},:values)::Dict{Symbol,Any},(top(getfield))(b::TypedDict{symbol("##ti#7555")},:values)::Dict{Symbol,Any})::Dict{Symbol,Any})::Any # none, line 3:
      return (Main.getindex)(d,$(Expr(:new, Field{:baz})))::Any
  end::Any

So, ignoring the issue of ordering (we could just make the DataFrames version of TypedDict hold an OrderedDict of Vectors), I can bork inferencing on the indexing operations simply by preceding it with an hcat. Now, I wrote that hcat really naively, so maybe there's a better way to write it that would help?

My fear is that, under this approach, any function f(::DataFrame) --> DataFrame could throw a wrench into type inferencing, unless the caller properly manipulates the internals (e.g. write generated functions to handle it). I might be being overly pessimistic, though.

MikeInnes commented 9 years ago

any function f(::DataFrame) --> DataFrame could throw a wrench into type inferencing

Only a relatively narrow class of functions, i.e. those that manipulate the internals of the type rather than going through an external interface. So your hcat/merge would have to do the right type gymnastics, but once hcat is inferable its caller doesn't have to do anything special. (I think that's the same as the other prototypes here, right?)

Happy to prototype stuff like hcat if it's useful.

jrevels commented 9 years ago

I'm traveling atm, so I can't test this out, but here's the general issue I was thinking about:

function f(df::DataFrame) df2 = g(df) # returns DataFrame return h(df2) # returns whatever end

Even if g and h are type certain in a global context by way of using generated functions, the fact that the type information is "boxed" could potentially prevent the return type of f from being properly inferred. Like I said, I could be wrong, I'd have to see it in action to be sure.

The other "hide type info from the compiler" approaches may also have this problem (if it indeed exists and I'm not blatantly wrong).

jrevels commented 9 years ago

Never mind, just got a chance to test this out, and I think proved myself wrong. I believe I understand why I was wrong too. Apologies for derailing with non-issues.

You can pull off such crazy stuff with generated functions, it throws me for a loop sometimes...

MikeInnes commented 9 years ago

Glad to have an extra set of eyes on this either way :) I think your basic concern about generated functions proliferating everywhere is a very valid one, but it's hard to know how that will play out without experimenting a bit more, I think.

simonbyrne commented 9 years ago

So would the idea be that the type parameter symbol is hashed from the fields & types?

MikeInnes commented 9 years ago

Sort of. The symbol stores an arbitrary set of type information Symbol=>Type and any TypedDict whose key/value set matches a subset of those types can use that key for type info. That means that adding new keys to the dict, for example, will usually reuse the same key, which minimises the number of types that Julia has to specialise for.

quinnj commented 9 years ago

I've been following this thread with great interest for a while now, but just recently started to run into this problem in a very real way in my own packages. I'm still not totally decided on a solid interface, but I feel like I've learned a couple of things along the way.

On the need for "type kernel" methods, I'm starting to feel quite certain there's no amount of "hacking" that will get around the need for them. I think it will take an extra compiler optimization as mentioned in https://github.com/JuliaLang/julia/issues/5654, or some other kind of generated types approach. The key missing optimization is a kind of intelligence during type inference that when you pull a parameterized element out of another object, it's necessary to have the ability to further type-infer code on the extracted parameterized element; the simplest case I have is a field in an object like Vector{Vector{T}}. Each individual Vector{T} will have a different T and would end up with slightly different code patterns depending on such. Currently, however, type inference can only see Vector{T} as the inferred type and no further specialization is done, despite the fact that there will be some kind of specialized T at run-time.

I'd appreciate any feedback on DataTable; it's by no means my "replacement" for DataFrames. On the contrary, I'd much rather see progress in DataFrames and type inference. In the mean time, if anyone wants to help me hack on it, I'd appreciate the help.

matthieugomez commented 9 years ago

What does the restriction (in term of Int, Float64, Date etc) allow in term of type stability?

matthieugomez commented 9 years ago

It seems like there are two issues with the current proposals: the syntax for type stable indexing is cumbersome, and each new dataframe generates a new type. I'm wondering if the following works: keep the current DataFrame type by default and — only when needed — create a fast iterator on dataframe rows with eachrow(df). eachrow would return something similar to the specialized dataframe type proposed above. The syntax would be something like:

function f(eachrow::EachRow)
out = 0
for (v1, v2) in eachrow
   out += v2 > 0 ? v1 : 0
end
df = DataFrame(x = 1:2, y = [0.1, ,3.0], z = 2:3)
f(eachrow(df[[:x, :y]]))
nalimilan commented 9 years ago

I don't think that would work, unfortunately. The type of the iterator can't be inferred either, since it depends on the DataFrame argument whose type doesn't provide information on the types of the columns.

matthieugomez commented 9 years ago

yeah sorry i updated my post. I think this one works right?

nalimilan commented 9 years ago

I don't understand your new post: does it need a data frame specialized on the column types or not? You seem to imply that as well as the contrary.

simonbyrne commented 8 years ago

For a quick summary, see @johnmyleswhite's post: http://www.johnmyleswhite.com/notebook/2015/11/28/why-julias-dataframes-are-still-slow/

c42f commented 8 years ago

Gist here, use at your own risk, may set Julia on fire etc.

@MikeInnes - this link is now broken, but I'm rather interested in seeing the direction you were taking. Do you still have the code around somewhere?

datnamer commented 8 years ago

I'd also like to check out the gist.

MikeInnes commented 8 years ago

Sure, you can actually see the whole code at Data.jl and I put up a self-contained gist here. Data.jl is basically just a fleshed-out version where the full DataFrame object is built on top of the TypedDict.

Honestly though, when I was playing around with this I'm not sure the type inferrability of dataframe indexing made a whole lot of difference. You end up having to factor operations out into acting on columns anyway (see e.g. DecisionTrees.jl).

For me the biggest wins came from (1) replacing DataArray with Vector{Nullable}, (2) specialising columns where appropriate (removing Nullable, pooled arrays etc.) (3) writing fast column-level kernels for high-level operations. That gets you performance on par with scikit-learn without having to mess around with matrices and strange data encodings.

johnmyleswhite commented 8 years ago

We should probably document the ways in which non-concrete types harm the performance of code written for DataFrames until we have a better design that we can all agree on. Each of the problems that comes up is very different from the others.

df = get_data_frame()

for r in eachrow(df)
    do_something_per_row(r)
end
df = get_data_frame()

do_something_to_whole_table(df)
df = get_data_frame()

do_something_to_columns(df[:a], df[:b], df[:c])

function do_something_to_columns(a, b, c)
    s = 0.0
    for i in 1:length(a)
        s += a[i] + b[i] + c[i]
    end
   return s
end
davidagold commented 8 years ago

@johnmyleswhite What is a "kernel function" in this context?

johnmyleswhite commented 8 years ago

It's possible I don't understand what other people mean, but I'm using kernel function to refer to a function f(args...) all of whose arguments are the columns of a DataFrame after extracting them and resolving any type-uncertainty. This function is where the core computational work happens; as long as it's fast, the rest of the system has acceptable performance.

andyferris commented 8 years ago

Hi everyone,

At work we need to manage data with multiple types and I've also been working on how to do typed data containers in Julia and have created a reasonably functional package called Tables.jl.

My approach is a little more complex (and heavy on metaprogramming) than the example given by @MikeInnes. But it also includes a first-class Row tuple-wrapper which actually makes dealing with data in row chunks fast (@johnmyleswhite - in Julia we can have our cake and eat it). Or you can extract the columns / raw data / etc as necessary.

I have seen recent additions to Julia 0.5 that fix a few speed problems with tuples, so this should become a reasonably fast approach (it should already be better than DataFrames with no care taken with regards to type stability, but I haven't done any in-depth benchmarking beyond inspecting the generated llvm/native code). Furthermore, using the new @pure meta and factoring out some of the metaprogramming (into a separate typed-dict package or something) should simplify the code significantly and allow for static compilation in the future.

I would appreciate any feedback. I was also thinking of making a PR to METADATA to make it public, but I definitely didn't want to annoy or upset the JuliaStats community either, so I am also seeking comments regarding this (i.e. if there was soon-to-be-released something similar from you guys, or if there was vehement opposition to having multiple packages with some overlap in functionality, etc).

tshort commented 8 years ago

Great, @andyferris! Please feel free to register Tables. Looks like lots of good stuff in there. The API for Tables is wordy and noisy. That may be the price for type stability. To simplify, could @pure allow tbl[:A] to be treated like tbl[Val{:A}]?

Are you planning conversion methods to DataFrames?

andyferris commented 8 years ago

@tshort Thanks, and yes, that is quite possible. At the very least, I just tried

julia> function f(x::Symbol)
         Base.@_pure_meta
         Val{x}
       end
f (generic function with 1 method)

julia> g() = f(:a)
g (generic function with 1 method)

julia> code_warntype(g,())
Variables:
  #self#::#g

Body:
  begin  # none, line 1:
      return $(QuoteNode(Val{:a}))
  end::Type{Val{:a}}

in nightly. This seems promissing to me. At the very least, this doesn't seem possible in 0.4:

julia> @inline f2(x::Symbol) = Val{x}
f2 (generic function with 1 method)

julia> g2() = f2(:a)
g2 (generic function with 1 method)

julia> code_warntype(g2,())
Variables:
  #self#::#g2

Body:
  begin  # none, line 1:
      return (top(apply_type))(Main.Val,:a)::Type{_<:Val{T}}
  end::Type{_<:Val{T}}

This change will make the interface for Tables.jl quite a bit nicer to use. Would be interesting if the constructors could be cleaned up too.

And yes methods to eat and spit out DataFrames is planned... at the very least that will provide I/O in the short-term.

tshort commented 8 years ago

Great news, @andyferris. I'd shoot for v0.5 then. That would make the "everyday" API much nicer.

nalimilan commented 8 years ago

Really interesting! I hadn't anticipated that @pure would allow tbl[:A] to be type-stable.

I have a few questions, which may well reflect my misunderstanding: 1) Why do @table, @field and @cell need to be macros? 2) Likely related: Why are type annotations in @table(A::Int64=[1,2,3], B::Float64=[2.0,4.0,6.0]) or @cell(A::Int64=1) required? I guess they can be useful if you want to force an abstract type to be used when passing a single value to @cell, but shouldn't inference be enough in other cases? 3) I don't understand the signification/need for FieldIndex. Shouldn't this information be stored as a Tuple{} of field types in the table or row objects, instead of being exposed to the user?

(Finally, the name bikeshedding session: I think Tables.jl is too general. I also had a package called that way, and I renamed it to FreqTables.jl before registering it, as tables can mean very different things in different fields. How about DataTables.jl, which is the name you use at the top of README.md? Anyway, probably better discuss this in the METADATA.jl registration PR to keep the present thread focused.)

andyferris commented 8 years ago

Thanks @nalimilan. Yes, the @pure thing is pretty cool... I'm not certain yet if it is a "compiler hint" or follows a clear set of rules. Currently Julia's type inference system gives up and goes on holiday when, e.g., passing around large tuples and so-on. Fortunately, generated functions seem to cause it to not be lazy (for obvious reasons).

Let me address your questions

1) The macros are only for convenience. The inner constructor for the table is something like:

Table{FieldIndex{(Field{:A,Int64}(), Field{:B,Float64}())}(), Tuple{Int64,Float64}, Tuple{Array{Int64,1},Array{Float64,1}}}(([1,2,3], [2.0,4.0,6.0]))

Yuk, right? Anyway, I realize it is dreadful and some information can be computed (the element types from the array types, perhaps the fields could just be names, I don't know).

2) That was just my original design. A Field combines a name and a type, just like the fields of type and immutable objects. The programmer designs a table with specific data in mind - and data can be converted upon input, or whatever, transparently. The type contract is never broken.

As for the macros, I wanted to allow both field names and Field objects to be used, so the non-type-annotated version assumes a field already exists, like:

field_A = Field{:A,Int64}()
cell_1 = @cell(field_A = 42)
cell_2 = @cell(A::Int64 = 42) # == cell_1

I wasn't sure how to make both inputs work. DIfferent macros? Use a symbol, like @cell(:A = 42)? I think the API could definitely be refined.

3) Great question. This was a design decision I made for a couple of reasons. First, I liked that I could write an object where some invariants are checked in the constructor (e.g. it won't allow two fields with the same name, or any called :Row). Similarly, being a distinct type means I can dispatch many things with the type, such as call with a tuple which creates a Row or Table depending on the types.

I believe this could be simplified, but some things might have to be checked later in the process. Also, working with Tuple types in 0.4 is a bit painful. I'm guessing there will be enough metaprogramming power in @pure that some of these things can be done differently (and without generated functions).

Finally, regarding the package name, I shortened the module name myself when I got frustrated with typing a longer name during development (frequent reload commands means I use import not using). Tables.jl, DataTables.jl and TypedTables.jl are all acceptable to me, but I plan bring this up in the PR.

nalimilan commented 8 years ago

Sorry for the late reply. I wonder why you wouldn't be able to simplify the type to this: Table{Tuple{:A, :B}, Tuple{Array{Int64,1}, Array{Float64, 1}}}, i.e. a tuple type of column names and a tuple type of column types. This wouldn't prevent checking for invariants at all. Then Table(A=[1,2,3], B=[2.0,4.0,6.0]) could easily infer the type parameters from its arguments.

Finally, I don't think you need to store both the type of the columns and their element type: eltype should always provide this information if you call it on the column type. This should help reduce the number of parameters.

andyferris commented 8 years ago

Yes Milan, good idea and I agree and have thought about this lately, but I haven't got around to doing a type-design iteration yet. And yes the invariants can be checked - it's just a matter of when. I had included both the element types and storage types in the header because I was trying to see how far one could go without generated functions, because eltypes (plural, not singular) fails to generate good type information on Julia 0.4 if it is not generated. One could also use Julia 0.5 pure functions to fix that. But as it stands, most the type-stability (and zero run-time overhead) relies heavily on generated functions.

The other place the element type is useful is getting the correct function signature, for instance to push a Row onto a Table. But it is superfluous - one just moves the checks into the function (a generated/pure helper function should lead to zero run-time overhead).

Do you think Tuple{:A,:B} is OK? Interestingly, it's a type that can't be instantiated and Steffan and/or Jeff once saw I had done that before and they thought it was a mistake that Julia even allows it (sorry I can't seem to find the reference to that discussion right now).

The other two possibilities are:

Table{Tuple{Val{:A}, Val{:B}}, Tuple{Array{Int64,1}, Array{Float64, 1}}}

or

Table{(:A, :B), Tuple{Array{Int64,1}, Array{Float64, 1}}}

I was considering the latter. A field name might be carried in a Val of a symbol, and a collection of fields (currently FieldIndex) is a Val of a tuple of symbols. Or, I could make a new type like Name and Names for these, which would have many methods defined (like currently I can take the union of two FieldIndexs - I would feel bad for defining a bunch of methods on Val). Or, fields and indices can stay as they are, with full type information.

Your opinion?

nalimilan commented 8 years ago

I would go with the latter ((:A, :B)), as indeed the column names are not types, they are just parameters. And I don't think Val is supposed to be used that way, it's just a way to dispatch on a specific value when calling a function.

andyferris commented 8 years ago

Indeed, I agree with you @nalimilan. I expect I'll go that way when I find the time. :)

As a general comment, I don't think there is much top-down imposition nor much community consensus on how Val should be used. It is defined in Base as immutable Val{T}; end without comment or any methods defined for manipulating Vals; furthermore there is AFAIK just a few comments online discussing its use (e.g. Val{T}() vs Val{T}). Certainly, Tuple{Val{x}} can at least be instantiated as (Val{x}()), so it is better than Val{:A,:B}. Also compare using Val-types vs other singletons in Base like LinearFast() which are passed as values not types.

Anyway, that is just an aside. I have some deeper questions for actual R-data.frames/data.table/dplyr and python-pandas users. I've come up with a macro @select which easily combines the notation of dplyr's select and mutate together (my Tables are immutable, so mutate doesn't even make sense in Julia). It is used like:

@select(table, col1, newname = col2, newcol::newtype = col1 -> f(col1))

where it has three possible behaviours, respectively: selecting a column, renaming a column (and including it) and creating a new column via a row-by-row calculation. The syntax is quite constrained by the semantics of Julia, but any comments or suggested improvements would be welcome!

I tried to emulate dplyr-like syntax since it seems to have the preferable syntax (compared to R's data.table, which people seem to prefer for speed not syntax). As such, I was hoping to implement @filter, @arrange etc. Suggestions and greedy desires from other data manipulators would be useful at this early stage!

datnamer commented 8 years ago

@andyferris have you seen this? https://github.com/JuliaStats/DataFramesMeta.jl

nalimilan commented 8 years ago

Sounds like a good strategy. As @datnamer, I think the state of our reflections with regard to this is in DataFramesMeta (see also the issues in that project). It would make sense to follow SLQ/LINQ/dplyr as much as possible, unless there's a good reason not to.

There has also been some discussion regarding APIs here: https://github.com/JuliaStats/DataFrames.jl/issues/369.

datnamer commented 8 years ago

This may be premature, but are there any thoughts on SQL code generation?

Edit: There is also this https://github.com/MikeInnes/Flow.jl. I wonder if that can help (query planning, delayed eval etc)

andyferris commented 8 years ago

@datnamer: quick answer regarding SQL is "no". But thank you to both of you for interesting reads (I definitely had seen DataFramesMeta quite some time ago but hadn't thought about it lately).

Regarding #369, the point is just syntax where you write code in devectorized form? I think my new @select macro should be fine for this since it uses a hidden comprehension, which allow for pretty generic code inside. The user can declare the element type but not the container type (so its hard to use NullableArrays or pooled data structures - I need to think more about this).

nalimilan commented 8 years ago

Regarding #369, the point is just syntax where you write code in devectorized form? I think my new @select macro should be fine for this since it uses a hidden comprehension, which allow for pretty generic code inside. The user can declare the element type but not the container type (so its hard to use NullableArrays or pooled data structures - I need to think more about this).

Well, mostly, but it would also allow you to refer to all existing columns as variables, as well as to create new ones like you would create variables. So it would take a complete code block instead of short expressions passed as separate arguments. See the example I provided there. It's particularly useful when combining several variables using if to create a new one. Then there's also the question of propagating nulls automatically, which is a bit harder.

nalimilan commented 5 years ago

Closing as we're not going to make the DataFrame type encode column type information in 1.0.