DataFrames replacement that has no type-uncertainty

tshort commented 9 years ago

@johnmyleswhite started this interesting email thread:

https://groups.google.com/forum/#!topic/julia-dev/hS1DAUciv3M

Discussions included:

Row-based storage vs. column-based storage -- both approaches have advantages.
@simonbyrne prototyped an approach to type certainty using a composite type that embeds the column information as type parameters. Staged functions are used for getindex. To maintain type certainty, indexing has to be done as df[1, Field{:a}()] instead of df[1, :a].
@one-more-minute noted that a string macro, field"a", could make the column indexing look a bit better for the example above.
CompositeDataFrames are one approach to type certainty. Types are certain if indexed as df.a but not as df[:a] (same issue as Simon's approach).
The function boundary (or lack of) causes issues. For example, we'd like to be able to read in a DataFrame and do analysis within the same function.

The following issues in Base may help with type certainty:

https://github.com/JuliaLang/julia/issues/5654 - specialize loops inside functions
https://github.com/JuliaLang/julia/issues/8472 - staged types
https://github.com/JuliaLang/julia/issues/1974 - overload a.b

MikeInnes commented 9 years ago

So, I hit on a very similar problem to this yesterday and came up with something you lot might be interested in – although I'm not yet sure if it's actually usable or just a horrible hack. Gist here, use at your own risk, may set Julia on fire etc.

As in Simon/Jarrett's implementations, we want to hold as much type info as possible so that we can help the compiler, but the trick is to actively hide that information from the type system until we know it will be useful. Take:

d = TypedDict(f"foo"=>1, f"bar"=>"foo")
d[f"foo"]

(You can of course use symbols as with regular Dicts, but using the field string macro makes both the constructor and the indexing type-inferrable.) The type of the dict is TypedDict{I} where I is a symbol – the type information is associated with the symbol in a way that the staged functions can retrieve, but since it's opaque to the compiler we're no longer trying to do inference on Big Data type tags.

The neat part of this is that the same symbol can share type info for many different objects. Operations on TypedDicts which add or remove fields can generally return a TypedDict of the exact same type (i.e. the same symbol parameter), which means no crazy proliferation of generated code. The naive user who tries to hcat a million-column dataframe will be fine.

Potential caveats:

Type-stability of indexing will speed up code around it, but the indexing operation itself isn't optimal. Not sure if that will be an issue – perhaps it can be optimised further within this approach.
Proliferation of generated functions: You don't want every function of a data frame to manually handle type info. Maybe building on a small core of these functions is enough – another option might be to have a macro which allows you to specify the type signature of a function and generate an appropriate generated function.
Unlike in the other prototypes, the column names of a dataframe are not part of the type. Not sure what implications this has.
I haven't tested this much in practice so there may be other unexpected issues.

Any thoughts? I'll probably play around with this a bit more and give an update if there's any particular issues or successes.

jrevels commented 9 years ago

This is an interesting idea, but I found a case that might thwart this approach for our purposes. For example, bouncing off of your code:

julia> function Base.hcat(a::TypedDict, b::TypedDict) # naive hcat
           return TypedDict(merge(a.values, b.values)...)
       end
hcat (generic function with 14 methods)

julia> a = TypedDict(f"foo"=>1, f"bar"=>"foo")
TypedDict{symbol("##ti#7555")}(Dict{Symbol,Any}(:bar=>"foo",:foo=>1))

julia> b = TypedDict(f"baz"=> 23.21)
TypedDict{symbol("##ti#7555")}(Dict{Symbol,Any}(:baz=>23.21))

julia> function f(a, b)
           d = hcat(a, b)
           return d[f"baz"]
       end
f (generic function with 1 method)

julia> @code_warntype f(a, b)
Variables:
  a::TypedDict{symbol("##ti#7555")}
  b::TypedDict{symbol("##ti#7555")}
  d::Any

Body:
  begin  # none, line 2:
      d = (top(_apply))((top(getfield))(Main,:call)::F,Main.TypedDict,(Main.merge)((top(getfield))(a::TypedDict{symbol("##ti#7555")},:values)::Dict{Symbol,Any},(top(getfield))(b::TypedDict{symbol("##ti#7555")},:values)::Dict{Symbol,Any})::Dict{Symbol,Any})::Any # none, line 3:
      return (Main.getindex)(d,$(Expr(:new, Field{:baz})))::Any
  end::Any

So, ignoring the issue of ordering (we could just make the DataFrames version of TypedDict hold an OrderedDict of Vectors), I can bork inferencing on the indexing operations simply by preceding it with an hcat. Now, I wrote that hcat really naively, so maybe there's a better way to write it that would help?

My fear is that, under this approach, any function f(::DataFrame) --> DataFrame could throw a wrench into type inferencing, unless the caller properly manipulates the internals (e.g. write generated functions to handle it). I might be being overly pessimistic, though.

MikeInnes commented 9 years ago

any function f(::DataFrame) --> DataFrame could throw a wrench into type inferencing

Only a relatively narrow class of functions, i.e. those that manipulate the internals of the type rather than going through an external interface. So your hcat/merge would have to do the right type gymnastics, but once hcat is inferable its caller doesn't have to do anything special. (I think that's the same as the other prototypes here, right?)

Happy to prototype stuff like hcat if it's useful.

jrevels commented 9 years ago

I'm traveling atm, so I can't test this out, but here's the general issue I was thinking about:

function f(df::DataFrame) df2 = g(df) # returns DataFrame return h(df2) # returns whatever end

Even if g and h are type certain in a global context by way of using generated functions, the fact that the type information is "boxed" could potentially prevent the return type of f from being properly inferred. Like I said, I could be wrong, I'd have to see it in action to be sure.

The other "hide type info from the compiler" approaches may also have this problem (if it indeed exists and I'm not blatantly wrong).

jrevels commented 9 years ago

Never mind, just got a chance to test this out, and I think proved myself wrong. I believe I understand why I was wrong too. Apologies for derailing with non-issues.

You can pull off such crazy stuff with generated functions, it throws me for a loop sometimes...

MikeInnes commented 9 years ago

Glad to have an extra set of eyes on this either way :) I think your basic concern about generated functions proliferating everywhere is a very valid one, but it's hard to know how that will play out without experimenting a bit more, I think.

simonbyrne commented 9 years ago

So would the idea be that the type parameter symbol is hashed from the fields & types?

MikeInnes commented 9 years ago

Sort of. The symbol stores an arbitrary set of type information Symbol=>Type and any TypedDict whose key/value set matches a subset of those types can use that key for type info. That means that adding new keys to the dict, for example, will usually reuse the same key, which minimises the number of types that Julia has to specialise for.

quinnj commented 9 years ago

I've been following this thread with great interest for a while now, but just recently started to run into this problem in a very real way in my own packages. I'm still not totally decided on a solid interface, but I feel like I've learned a couple of things along the way.

NullableArrays.jl is leagues ahead of DataArrays.jl; my frustration with DataFrames has mainly come from the use of DataArrays and the dreaded NA. NullableArrays is clearly the path forward and I'd encourage a swift switch to make DataFrames depend on it by default.
I've hacked recently on my own "type-stable" configuration for table data. It's by no means awesome, but is an interesting approach nonetheless. Notable points on the implementation:
- DataTable is limited to strong typing for Int, Float64, strings, Date, and DateTime, all other columns get lumped in a Any bucket
- Certainly limiting, but I also play much more in the vast world of talking databases/files/systems/APIs where these types represent 99% of all data
- Similarly typed columns are grouped together and indexing involves passing the type of the column expected to ensure type stability
- I still used "type kernel" methods (as I call them) that you immediately pass an extracted column to in order to have type-stable "inner" methods that actually operate on the column. This typically only involves a single "type kernel" method kernelmethod{T}(col::NullableVector{T}), or one method for each supported type (Int, Float64, etc.).

On the need for "type kernel" methods, I'm starting to feel quite certain there's no amount of "hacking" that will get around the need for them. I think it will take an extra compiler optimization as mentioned in https://github.com/JuliaLang/julia/issues/5654, or some other kind of generated types approach. The key missing optimization is a kind of intelligence during type inference that when you pull a parameterized element out of another object, it's necessary to have the ability to further type-infer code on the extracted parameterized element; the simplest case I have is a field in an object like Vector{Vector{T}}. Each individual Vector{T} will have a different T and would end up with slightly different code patterns depending on such. Currently, however, type inference can only see Vector{T} as the inferred type and no further specialization is done, despite the fact that there will be some kind of specialized T at run-time.

I'd appreciate any feedback on DataTable; it's by no means my "replacement" for DataFrames. On the contrary, I'd much rather see progress in DataFrames and type inference. In the mean time, if anyone wants to help me hack on it, I'd appreciate the help.

matthieugomez commented 9 years ago

What does the restriction (in term of Int, Float64, Date etc) allow in term of type stability?

matthieugomez commented 9 years ago

It seems like there are two issues with the current proposals: the syntax for type stable indexing is cumbersome, and each new dataframe generates a new type. I'm wondering if the following works: keep the current DataFrame type by default and — only when needed — create a fast iterator on dataframe rows with eachrow(df). eachrow would return something similar to the specialized dataframe type proposed above. The syntax would be something like:

function f(eachrow::EachRow)
out = 0
for (v1, v2) in eachrow
   out += v2 > 0 ? v1 : 0
end
df = DataFrame(x = 1:2, y = [0.1, ,3.0], z = 2:3)
f(eachrow(df[[:x, :y]]))

nalimilan commented 9 years ago

I don't think that would work, unfortunately. The type of the iterator can't be inferred either, since it depends on the DataFrame argument whose type doesn't provide information on the types of the columns.

matthieugomez commented 9 years ago

yeah sorry i updated my post. I think this one works right?

nalimilan commented 9 years ago

I don't understand your new post: does it need a data frame specialized on the column types or not? You seem to imply that as well as the contrary.

simonbyrne commented 8 years ago

For a quick summary, see @johnmyleswhite's post: http://www.johnmyleswhite.com/notebook/2015/11/28/why-julias-dataframes-are-still-slow/

c42f commented 8 years ago

Gist here, use at your own risk, may set Julia on fire etc.

@MikeInnes - this link is now broken, but I'm rather interested in seeing the direction you were taking. Do you still have the code around somewhere?

datnamer commented 8 years ago

I'd also like to check out the gist.

MikeInnes commented 8 years ago

Sure, you can actually see the whole code at Data.jl and I put up a self-contained gist here. Data.jl is basically just a fleshed-out version where the full DataFrame object is built on top of the TypedDict.

Honestly though, when I was playing around with this I'm not sure the type inferrability of dataframe indexing made a whole lot of difference. You end up having to factor operations out into acting on columns anyway (see e.g. DecisionTrees.jl).

For me the biggest wins came from (1) replacing DataArray with Vector{Nullable}, (2) specialising columns where appropriate (removing Nullable, pooled arrays etc.) (3) writing fast column-level kernels for high-level operations. That gets you performance on par with scikit-learn without having to mess around with matrices and strange data encodings.

johnmyleswhite commented 8 years ago

We should probably document the ways in which non-concrete types harm the performance of code written for DataFrames until we have a better design that we can all agree on. Each of the problems that comes up is very different from the others.

You try to stream through individual rows of a DataFrame as atomic chunks of data. This basically never works because the rows would need to be tuples to be acceptably fast. Basically, this pattern is always a warning sign:

df = get_data_frame()

for r in eachrow(df)
    do_something_per_row(r)
end

You try to operate on a DataFrame without using any "kernel functions", which really means that you try to call a function on a DataFrame, when you actually need to call functions on its columns instead. This is the problem that a type-stable DataFrame solves: it allows you to write functions that operate on a DataFrame as a single argument. This lets you write code that looks like this without having to worry about what happens inside of do_something_to_whole_table(df).

df = get_data_frame()

do_something_to_whole_table(df)

You use kernel functions that operate on the columns of a DataFrame, but the columns are DataArrays that generate scalar values of type Union{T, NAtype}. These scalar values get boxed and computations on them go through dynamic dispatch, so the computations are much slower than you'd get if you worked with a vector whose elements were all a primitive type like Float64.

df = get_data_frame()

do_something_to_columns(df[:a], df[:b], df[:c])

function do_something_to_columns(a, b, c)
    s = 0.0
    for i in 1:length(a)
        s += a[i] + b[i] + c[i]
    end
   return s
end

davidagold commented 8 years ago

@johnmyleswhite What is a "kernel function" in this context?

johnmyleswhite commented 8 years ago

It's possible I don't understand what other people mean, but I'm using kernel function to refer to a function f(args...) all of whose arguments are the columns of a DataFrame after extracting them and resolving any type-uncertainty. This function is where the core computational work happens; as long as it's fast, the rest of the system has acceptable performance.

andyferris commented 8 years ago

Hi everyone,

At work we need to manage data with multiple types and I've also been working on how to do typed data containers in Julia and have created a reasonably functional package called Tables.jl.

My approach is a little more complex (and heavy on metaprogramming) than the example given by @MikeInnes. But it also includes a first-class Row tuple-wrapper which actually makes dealing with data in row chunks fast (@johnmyleswhite - in Julia we can have our cake and eat it). Or you can extract the columns / raw data / etc as necessary.

I have seen recent additions to Julia 0.5 that fix a few speed problems with tuples, so this should become a reasonably fast approach (it should already be better than DataFrames with no care taken with regards to type stability, but I haven't done any in-depth benchmarking beyond inspecting the generated llvm/native code). Furthermore, using the new @pure meta and factoring out some of the metaprogramming (into a separate typed-dict package or something) should simplify the code significantly and allow for static compilation in the future.

I would appreciate any feedback. I was also thinking of making a PR to METADATA to make it public, but I definitely didn't want to annoy or upset the JuliaStats community either, so I am also seeking comments regarding this (i.e. if there was soon-to-be-released something similar from you guys, or if there was vehement opposition to having multiple packages with some overlap in functionality, etc).

tshort commented 8 years ago

Great, @andyferris! Please feel free to register Tables. Looks like lots of good stuff in there. The API for Tables is wordy and noisy. That may be the price for type stability. To simplify, could @pure allow tbl[:A] to be treated like tbl[Val{:A}]?

Are you planning conversion methods to DataFrames?

andyferris commented 8 years ago

@tshort Thanks, and yes, that is quite possible. At the very least, I just tried

julia> function f(x::Symbol)
         Base.@_pure_meta
         Val{x}
       end
f (generic function with 1 method)

julia> g() = f(:a)
g (generic function with 1 method)

julia> code_warntype(g,())
Variables:
  #self#::#g

Body:
  begin  # none, line 1:
      return $(QuoteNode(Val{:a}))
  end::Type{Val{:a}}

in nightly. This seems promissing to me. At the very least, this doesn't seem possible in 0.4:

julia> @inline f2(x::Symbol) = Val{x}
f2 (generic function with 1 method)

julia> g2() = f2(:a)
g2 (generic function with 1 method)

julia> code_warntype(g2,())
Variables:
  #self#::#g2

Body:
  begin  # none, line 1:
      return (top(apply_type))(Main.Val,:a)::Type{_<:Val{T}}
  end::Type{_<:Val{T}}

This change will make the interface for Tables.jl quite a bit nicer to use. Would be interesting if the constructors could be cleaned up too.

And yes methods to eat and spit out DataFrames is planned... at the very least that will provide I/O in the short-term.

tshort commented 8 years ago

Great news, @andyferris. I'd shoot for v0.5 then. That would make the "everyday" API much nicer.

nalimilan commented 8 years ago

Really interesting! I hadn't anticipated that @pure would allow tbl[:A] to be type-stable.

I have a few questions, which may well reflect my misunderstanding: 1) Why do @table, @field and @cell need to be macros? 2) Likely related: Why are type annotations in @table(A::Int64=[1,2,3], B::Float64=[2.0,4.0,6.0]) or @cell(A::Int64=1) required? I guess they can be useful if you want to force an abstract type to be used when passing a single value to @cell, but shouldn't inference be enough in other cases? 3) I don't understand the signification/need for FieldIndex. Shouldn't this information be stored as a Tuple{} of field types in the table or row objects, instead of being exposed to the user?

(Finally, the name bikeshedding session: I think Tables.jl is too general. I also had a package called that way, and I renamed it to FreqTables.jl before registering it, as tables can mean very different things in different fields. How about DataTables.jl, which is the name you use at the top of README.md? Anyway, probably better discuss this in the METADATA.jl registration PR to keep the present thread focused.)

andyferris commented 8 years ago

Thanks @nalimilan. Yes, the @pure thing is pretty cool... I'm not certain yet if it is a "compiler hint" or follows a clear set of rules. Currently Julia's type inference system gives up and goes on holiday when, e.g., passing around large tuples and so-on. Fortunately, generated functions seem to cause it to not be lazy (for obvious reasons).

Let me address your questions

1) The macros are only for convenience. The inner constructor for the table is something like:

Table{FieldIndex{(Field{:A,Int64}(), Field{:B,Float64}())}(), Tuple{Int64,Float64}, Tuple{Array{Int64,1},Array{Float64,1}}}(([1,2,3], [2.0,4.0,6.0]))

Yuk, right? Anyway, I realize it is dreadful and some information can be computed (the element types from the array types, perhaps the fields could just be names, I don't know).

2) That was just my original design. A Field combines a name and a type, just like the fields of type and immutable objects. The programmer designs a table with specific data in mind - and data can be converted upon input, or whatever, transparently. The type contract is never broken.

As for the macros, I wanted to allow both field names and Field objects to be used, so the non-type-annotated version assumes a field already exists, like:

field_A = Field{:A,Int64}()
cell_1 = @cell(field_A = 42)
cell_2 = @cell(A::Int64 = 42) # == cell_1

I wasn't sure how to make both inputs work. DIfferent macros? Use a symbol, like @cell(:A = 42)? I think the API could definitely be refined.

3) Great question. This was a design decision I made for a couple of reasons. First, I liked that I could write an object where some invariants are checked in the constructor (e.g. it won't allow two fields with the same name, or any called :Row). Similarly, being a distinct type means I can dispatch many things with the type, such as call with a tuple which creates a Row or Table depending on the types.

I believe this could be simplified, but some things might have to be checked later in the process. Also, working with Tuple types in 0.4 is a bit painful. I'm guessing there will be enough metaprogramming power in @pure that some of these things can be done differently (and without generated functions).

Finally, regarding the package name, I shortened the module name myself when I got frustrated with typing a longer name during development (frequent reload commands means I use import not using). Tables.jl, DataTables.jl and TypedTables.jl are all acceptable to me, but I plan bring this up in the PR.

nalimilan commented 8 years ago

Sorry for the late reply. I wonder why you wouldn't be able to simplify the type to this: Table{Tuple{:A, :B}, Tuple{Array{Int64,1}, Array{Float64, 1}}}, i.e. a tuple type of column names and a tuple type of column types. This wouldn't prevent checking for invariants at all. Then Table(A=[1,2,3], B=[2.0,4.0,6.0]) could easily infer the type parameters from its arguments.

Finally, I don't think you need to store both the type of the columns and their element type: eltype should always provide this information if you call it on the column type. This should help reduce the number of parameters.

andyferris commented 8 years ago

Yes Milan, good idea and I agree and have thought about this lately, but I haven't got around to doing a type-design iteration yet. And yes the invariants can be checked - it's just a matter of when. I had included both the element types and storage types in the header because I was trying to see how far one could go without generated functions, because eltypes (plural, not singular) fails to generate good type information on Julia 0.4 if it is not generated. One could also use Julia 0.5 pure functions to fix that. But as it stands, most the type-stability (and zero run-time overhead) relies heavily on generated functions.

The other place the element type is useful is getting the correct function signature, for instance to push a Row onto a Table. But it is superfluous - one just moves the checks into the function (a generated/pure helper function should lead to zero run-time overhead).

Do you think Tuple{:A,:B} is OK? Interestingly, it's a type that can't be instantiated and Steffan and/or Jeff once saw I had done that before and they thought it was a mistake that Julia even allows it (sorry I can't seem to find the reference to that discussion right now).

The other two possibilities are:

Table{Tuple{Val{:A}, Val{:B}}, Tuple{Array{Int64,1}, Array{Float64, 1}}}

or

Table{(:A, :B), Tuple{Array{Int64,1}, Array{Float64, 1}}}

I was considering the latter. A field name might be carried in a Val of a symbol, and a collection of fields (currently FieldIndex) is a Val of a tuple of symbols. Or, I could make a new type like Name and Names for these, which would have many methods defined (like currently I can take the union of two FieldIndexs - I would feel bad for defining a bunch of methods on Val). Or, fields and indices can stay as they are, with full type information.

Your opinion?

nalimilan commented 8 years ago

I would go with the latter ((:A, :B)), as indeed the column names are not types, they are just parameters. And I don't think Val is supposed to be used that way, it's just a way to dispatch on a specific value when calling a function.

andyferris commented 8 years ago

Indeed, I agree with you @nalimilan. I expect I'll go that way when I find the time. :)

As a general comment, I don't think there is much top-down imposition nor much community consensus on how Val should be used. It is defined in Base as immutable Val{T}; end without ~~comment or~~ any methods defined for manipulating Vals; furthermore there is AFAIK just a few comments online discussing its use (e.g. Val{T}() vs Val{T}). Certainly, Tuple{Val{x}} can at least be instantiated as (Val{x}()), so it is better than Val{:A,:B}. Also compare using Val-types vs other singletons in Base like LinearFast() which are passed as values not types.

Anyway, that is just an aside. I have some deeper questions for actual R-data.frames/data.table/dplyr and python-pandas users. I've come up with a macro @select which easily combines the notation of dplyr's select and mutate together (my Tables are immutable, so mutate doesn't even make sense in Julia). It is used like:

@select(table, col1, newname = col2, newcol::newtype = col1 -> f(col1))

where it has three possible behaviours, respectively: selecting a column, renaming a column (and including it) and creating a new column via a row-by-row calculation. The syntax is quite constrained by the semantics of Julia, but any comments or suggested improvements would be welcome!

I tried to emulate dplyr-like syntax since it seems to have the preferable syntax (compared to R's data.table, which people seem to prefer for speed not syntax). As such, I was hoping to implement @filter, @arrange etc. Suggestions and greedy desires from other data manipulators would be useful at this early stage!

datnamer commented 8 years ago

@andyferris have you seen this? https://github.com/JuliaStats/DataFramesMeta.jl

nalimilan commented 8 years ago

Sounds like a good strategy. As @datnamer, I think the state of our reflections with regard to this is in DataFramesMeta (see also the issues in that project). It would make sense to follow SLQ/LINQ/dplyr as much as possible, unless there's a good reason not to.

There has also been some discussion regarding APIs here: https://github.com/JuliaStats/DataFrames.jl/issues/369.

datnamer commented 8 years ago

This may be premature, but are there any thoughts on SQL code generation?

Edit: There is also this https://github.com/MikeInnes/Flow.jl. I wonder if that can help (query planning, delayed eval etc)

andyferris commented 8 years ago

@datnamer: quick answer regarding SQL is "no". But thank you to both of you for interesting reads (I definitely had seen DataFramesMeta quite some time ago but hadn't thought about it lately).

Regarding #369, the point is just syntax where you write code in devectorized form? I think my new @select macro should be fine for this since it uses a hidden comprehension, which allow for pretty generic code inside. The user can declare the element type but not the container type (so its hard to use NullableArrays or pooled data structures - I need to think more about this).

nalimilan commented 8 years ago

Regarding #369, the point is just syntax where you write code in devectorized form? I think my new @select macro should be fine for this since it uses a hidden comprehension, which allow for pretty generic code inside. The user can declare the element type but not the container type (so its hard to use NullableArrays or pooled data structures - I need to think more about this).

Well, mostly, but it would also allow you to refer to all existing columns as variables, as well as to create new ones like you would create variables. So it would take a complete code block instead of short expressions passed as separate arguments. See the example I provided there. It's particularly useful when combining several variables using if to create a new one. Then there's also the question of propagating nulls automatically, which is a bit harder.

nalimilan commented 5 years ago

Closing as we're not going to make the DataFrame type encode column type information in 1.0.

JuliaData / DataFrames.jl

DataFrames replacement that has no type-uncertainty #744