JuliaData / DataFrames.jl

In-memory tabular data in Julia
https://dataframes.juliadata.org/stable/
Other
1.73k stars 367 forks source link

Converting DataVectors to Vectors and DataMatrix's/DataFrame's to Matrix's #160

Closed johnmyleswhite closed 11 years ago

johnmyleswhite commented 11 years ago

We need to decide on the names of functions for converting Data* objects into standard Julia objects. I propose the following:

This would remove the array and arrayNA functions currently found in src/dataframe.jl. I think those names are confusing now that we support rank-n tensors under the name DataArray.

Thoughts?

HarlanH commented 11 years ago

Yes to the function names. For me, saying "na=True" (using Options?) doesn't necessarily get at the idea that you're changing the result type to Any. How about convert="Any" to generate an Any array, or convert="union" to generate a type union array. And maybe we should explicity support conversion of NAs to other values (strings, NaN, 999) as part of these routines?

On Fri, Jan 11, 2013 at 9:43 AM, John Myles White notifications@github.comwrote:

We need to decide on the names of functions for converting Data* objects into standard Julia objects. I propose the following:

  • vector(dv::DataVector{T}) converts to Vector{T} and fails on NA. This makes it equivalent to failNA.
  • vector(dv::DataVector, na = true) converts to Vector{Any} and propagates NA.
  • matrix(dm::DataMatrix{T}) converts to Matrix{T} and fails on NA. This makes it equivalent to failNA.
  • matrix(dm::DataMatrix, na = true) converts to Matrix{Any} and propagates NA.
  • matrix(df::DataFrame) converts to Matrix{T}, where T is the type union of coltypes(df). It fails on NA.
  • matrix(dv::DataFrame, na = true) converts to Matrix{Any} and propagates NA.

This would remove the array and arrayNA functions currently found in src/dataframe.jl. I think those names are confusing now that we support rank-n tensors under the name DataArray.

Thoughts?

— Reply to this email directly or view it on GitHubhttps://github.com/HarlanH/DataFrames.jl/issues/160.

johnmyleswhite commented 11 years ago

You're right: we can use better names for options via the Options module. I would prefer using symbols like convert = :any or convert = :union. I generally try to avoid allowing strings as ad hoc enums and feel better using symbols that way. I would suggest using types except that the union case wouldn't be as easy to handle.

I think your later case would be better covered by extending failNA and replaceNA to work with DataArray's and DataFrame's. I think they may already work for DataArray's of arbitrary rank.

One question remains: how will these cases overlap with calling convert(Array{Any}, DataArray) and convert(Array{Any}, DataFrame)?

tshort commented 11 years ago

+1 on John's names.

This is another case where I wish we had real keyword arguments. Options seem heavy and clumsy just for a single option: vector(dv, @options(convert => :union)). Even if we use Options for this, I'd like to also be able to write vector(dv, :union) or vector(dv, :union_convert). In another package I'm working on that uses Options a lot, I'm struggling with the best way to use Options and multiple dispatch. I've been sort-of following a rule to always provide Options when optional elements are involved (even one) but also allow multiple dispatch to handle the cases when there are only one or two options and also have a version of the method where all options are specified as function arguments (this is the version that ultimately gets called).

On strings vs. symbols for enums, I'm fine with symbols (or with strings), but we should plan on being consistent here. As far as consistency with option-type things, we should also try to coordinate with Gadfly. Right now, Daniel uses Dicts for option-type lists as:

p = plot(iris, {:x => "Sepal.Length", :y => "Sepal.Width"}, Geom.point)

This isn't directly aligned with the vector example, but this plot example doesn't involve enums.

I like Harlan's idea of an option (:raw_convert or something) to mean explicit conversion of NA's to something (strings, NaN, 999).

johnmyleswhite commented 11 years ago

I agree with @tshort that Options often feel heavy and clumsy. That's why I've avoided using them in my own packages, sometimes to the detriment of configurability.

I'm unsure how to align things with Gadfly since Daniel seems to have taken the option I would chosen myself: optional arguments are just listed in a Dict.

I'm still a little unsure about the :raw_convert case. This is the only case in which you have to look at the data itself to decide what's going to happen: in the other cases, everything can be done using the column types alone.

tshort commented 11 years ago

I don't have any great ideas on how to align with Gadfly. In the package I'm using Options a lot, @defaults is quite useful for me. That said, I think a Dict interface is more standard and a little cleaner for the user.

Changing the subject a little, did anyone else notice @dcjones has something like Sweave using Markdown and Pandoc:

https://github.com/dcjones/Gadfly.jl/blob/master/doc/overview.md

Daniel's doing great stuff.

johnmyleswhite commented 11 years ago

I just implemented a draft of this. Having code in front of me, I'm inclined to replace the earlier proposal with a slight variant:

The reason for this is that it's easier to be able to specify an output type than anything else: you want to be able to say that all- numeric DataFrame will become a Matrix{Float64} even if the columns have types like {Int32, Int64, Float64}.

tshort commented 11 years ago

What does vector(DataVector[1.0, NA], Float64) give? [1.0, NaN] or an error? It seems that both of these options might be desired in different conditions.

johnmyleswhite commented 11 years ago

An error.

I agree that the other behavior is desirable in some cases. Perhaps we should have:

vector(DataVector[1.0, NA], Float64, NaN)

as well? My only concern is that it's hard to make something like work for DataFrames, but it's easy to make it work for DataArray's.

tshort commented 11 years ago

That seems sensible to me.

HarlanH commented 11 years ago

I'm on board with this latest proposal.

Regarding Options, as a co-author on that, I wish it was used more. :) It's a bit clunky, but I'd rather we use it now and then have a design that can be transitioned to using keyword arguments naturally whenever they arrive, versus a Matlab-like design with a bunch of 10-argument functions.

On Fri, Jan 11, 2013 at 5:29 PM, Tom Short notifications@github.com wrote:

That seems sensible to me.

— Reply to this email directly or view it on GitHubhttps://github.com/HarlanH/DataFrames.jl/issues/160#issuecomment-12166767.

johnmyleswhite commented 11 years ago

In that case, this is closed by 7561a7c6108aa25f85acd7c5462e4a1182d46bbf

I agree entirely that we should try to use Options more. The difficulty for me is that I have trouble reasoning about how to combine Options with multiple dispatch. By the time I realize how I want something to work under Options, I've often coded up the secondary method handled by dispatch. Switching over the codebase to Options is one of my goals after I feel like we're not missing functionality.

HarlanH commented 11 years ago

Sounds fair!

On Fri, Jan 11, 2013 at 6:38 PM, John Myles White notifications@github.comwrote:

In that case, this is closed by 7561a7chttps://github.com/HarlanH/DataFrames.jl/commit/7561a7c6108aa25f85acd7c5462e4a1182d46bbf

I agree entirely that we should try to use Options more. The difficulty for me is that I have trouble reasoning about how to combine Options with multiple dispatch. By the time I realize how I want something to work under Options, I've often coded up the secondary method handled by dispatch. Switching over the codebase to Options is one of my goals after I feel like we're not missing functionality.

— Reply to this email directly or view it on GitHubhttps://github.com/HarlanH/DataFrames.jl/issues/160#issuecomment-12168981.