JuliaData / DataFrames.jl

In-memory tabular data in Julia
https://dataframes.juliadata.org/stable/
Other
1.71k stars 360 forks source link

Add more constructors and document them #85

Closed johnmyleswhite closed 11 years ago

johnmyleswhite commented 11 years ago

How to construct DataVec's:

DataVec(values)
DataVec(type, length)
dvzeros(length)
dvones(length)

How to construct DataFrame's:

DataFrame(DataVecs, Index)
DataFrame(DataVecs)
DataFrame(columns)
DataFrame(matrix)
DataFrame(type, nrow, ncol)
DataFrame(types, nrow)
DataFrame(types, names, nrow)
dfzeros(nrow, ncol)
dfones(nrow, ncol)
dfeye(nrow, ncol)
dfeye(nrow)

And some other desirable constructors that need to be implemented:

dvfalses(length)
dvtrues(length)
dvrand(length)
dvrand(length)

DataFrame(DictOfVects)
DataFrame(values_matrix, is_missing_matrix)

dfdiag(diagonal_values)
dffalses(nrow, ncol)
dftrues(nrow, ncol)
dfrand(nrow, ncol)
tshort commented 11 years ago

I think a lot of these are overkill and lead to too many functions. For example, creating a DataFrame that's an identity matrix seems pretty uncommon. Plus, it's easy enough to write DataFrame(eye(nrow,ncol)) which already works. The possible memory savings from a direct function doesn't seem worth it. Plus, most of the dvrand type functions won't even have memory savings over just doing DataVec(rand(len)).

I like all of the DataFrame and DataVec methods.

Also, we need to keep in mind PooledDataVecs, but again, you can normally use PooledDataVec(randi(3,5)).

johnmyleswhite commented 11 years ago

Ok. I'll nuke those. My main interest has just been making it easier to transition from normal Julia array operations to DataFrames. But you're totally right that the linear algebra constructors only make sense for DataMatrix objects and not DataFrames.

StefanKarpinski commented 11 years ago

If you follow the convention of a DataMatrix "owning" the memory from a normal matrix that is passed to it, you could even avoid copying the data at all, which would be nice. The same thing might apply to DataVecs and DataFrames.

johnmyleswhite commented 11 years ago

We should definitely avoid copying whenever possible.

milktrader commented 11 years ago

Question about DataFrame(DataVecs, Index) ... once a Time type is designed and implemented is this how we would replicate zoo, xts classes from R?

I understand that the reason xts is a matrix indexed by time instead of a data.frame indexed by time (not sure about zoo) is because of speed. It might be that a DataFrame indexed by Time type will be just as fast, not sure.

johnmyleswhite commented 11 years ago

I don't know enough the way in which zoo or xts work to answer that. But, in principle, creating a DataVec on top of the new Time type will be automatic. You can define a Time type right now and see how it should work:

load("DataFrames")
using DataFrames

include(path_expand("~/julia/extras/bitarray.jl"))

type Time
  seconds::Int64
end

DataVec(Array(Time, 10), bitfalses(10))

Unfortunately the definition of load is currently broken with the new module system and our DataVec constructor is doing something odd with attempts to convert times to Int64. We'll fix both soon.

In general, it should be trivial to make a DataVec contain arbitrary types: a DataVec{Complex} or a DataVec{Rational} should be effortless to create.

milktrader commented 11 years ago

Okay, that's the way I was envisioning a quick hack. If it works, Time can always be updated to something more sophisticated.

HarlanH commented 11 years ago

Excellent question. In general, I've been thinking of DFs as being in-memory columnar relational database tables. At some point we'll need to build indexes. But the current design does not have row-labels, intentionally. Instead, I've argued for arbitrarily indexed columns, which tends to avoid the inclination to do things like re-order an entire table to get O(n log n) performance. And I think they'll eventually be cleaner to implement when we want to do memory-mapped or distributed DFs. So, to look up something in a Julia DF, you have a time column, and reference that to get a row number, then look up the data in another column. When we get indexes, which we don't currently have, that process should be O(1) fast.

Note that this approach is definitely distinct from how Pandas works or R's time-series stuff works.

Also note that if we decide an xts/zoo type of representation is sufficiently distinct from DataFrame's relational view that we need another type, we can certainly do so. Call it DataSeries, maybe, and have it live next to DataFrame and DataMatrix and DataStream...

On Fri, Nov 30, 2012 at 4:22 PM, milktrader notifications@github.comwrote:

Question about DataFrame(DataVecs, Index) ... once a Time type is designed and implemented is this how we would replicate zoo, xts classes from R?

I understand that the reason xts is a matrix indexed by time instead of a data.frame indexed by time (not sure about zoo) is because of speed. It might be that a DataFrame indexed by Time type will be just as fast, not sure.

— Reply to this email directly or view it on GitHubhttps://github.com/HarlanH/DataFrames.jl/issues/85#issuecomment-10904393.

johnmyleswhite commented 11 years ago

Having recently read a bunch of stuff about Pandas' Series type, I'm still unsure whether they're more than NamedVectors with well-crafted functions for doing time series computations.

milktrader commented 11 years ago

So the idea is to create a new Type and call it DataSeries and keep it under the DataFrames package? That sounds very reasonable. To give you an idea of what I'm getting at, here is a Julia DataFrame with some market data on Apple Stock:

julia> tail(C)
DataFrame  (6,3)
           Adj Simple_Return   sma50
[1,]    525.62     -0.020973 632.173
[2,]    527.68    0.00391918 629.383
[3,]    565.73     0.0721081 627.234
[4,]    560.91   -0.00851997 624.905
[5,]     561.7    0.00140843 622.944
[6,]     571.5      0.017447 621.222

The same time series in R, in xts/zoo format looks like this:

> tail(A)                                                                                                                                    
              Adj Simple_Return    sma50
2012-11-15 525.62  -0.020973029 632.1730
2012-11-16 527.68   0.003919181 629.3830
2012-11-19 565.73   0.072108096 627.2336
2012-11-20 560.91  -0.008519965 624.9048
2012-11-21 561.70   0.001408426 622.9442
2012-11-23 571.50   0.017447036 621.2224

So basically (not completely though), we are simply indexing out each row with a Time type value. Here is some more info about R object A:

> class(A)
[1] "xts" "zoo"

Okay, we said that already. Now we take the last value from column 1 we get:

> last(A[,1])
             Adj.
2012-11-23 571.5

And that value without the index is:

> as.numeric(last(A[,1]))
[1] 571.5

The class of the index of A:

> class(index(A))
[1] "Date"

If we cast it as a data.frame, we still have the indexing of rows in place and it looks good, but all is not the same:

> B = as.data.frame(A)
> tail(B)
              Adj Simple_Return    sma50
2012-11-15 525.62  -0.020973029 632.1730
2012-11-16 527.68   0.003919181 629.3830
2012-11-19 565.73   0.072108096 627.2336
2012-11-20 560.91  -0.008519965 624.9048
2012-11-21 561.70   0.001408426 622.9442
2012-11-23 571.50   0.017447036 621.2224

> B[nrow(B),1]
[1] 571.5

> class(index(B))
[1] "integer"

We can also recast it to a zoo or xts object:

> C = as.zoo(B)

> class(index(C))
[1] "Date"

> D = as.xts(B)

> class(index(D))
[1] "POSIXct" "POSIXt" 

xts was born from zoo so there is no need for redundancy that I can see. Once the super awesome base Time type is implemented, DataSeries can simply be under the DataFrame type as a matrix indexed by Time.

StefanKarpinski commented 11 years ago

I have to wonder if there's not some generalization to be had here: is a DataSeries just a DataFrame with a particular column providing its ordering? I know there's kind of special stuff that can be done with time, but is this actually an instance of a more general thing? Can't we order data by other attributes in the same way?

HarlanH commented 11 years ago

I guess the question is whether we want to build data strructures that are/will be focused on order, or data structures that are/will be focused on relational operations ala split-apply-combine and join. The Pandas DFs are definitely optimized for the former. I think there's some benefit to focusing on the latter, and particularly to doing so with an eye towards implementations that support very large memory-mapped or distributed data. Ordering becomes something you want to avoid dealing with with enormous data sets, I think...

Which is why having DataSeries be a specialization of DataMatrix (or maybe the same thing as DataMatrix) makes sense to me.

On Sat, Dec 1, 2012 at 10:51 AM, Stefan Karpinski notifications@github.comwrote:

I have to wonder if there's not some generalization to be had here: is a DataSeries just a DataFrame with a particular column providing its ordering? I know there's kind of special stuff that can be done with time, but is this actually an instance of a more general thing? Can't we order data by other attributes in the same way?

— Reply to this email directly or view it on GitHubhttps://github.com/HarlanH/DataFrames.jl/issues/85#issuecomment-10918479.

johnmyleswhite commented 11 years ago

There is definitely a generalization: check out the Series type in Pandas, which just labels the entries of a vector, then guarantees that all operations on them are aligned regardless of the prima facie order of the entries. The Series arithmetic ops look something like:

series1 = {:a => 1, :b => 3} series2 = {:c => 3, a => 1}

@assert series1 + series2 .== {:a => 2, :b => NA, :c => NA}

johnmyleswhite commented 11 years ago

I have a pretty strong preference that DataMatrix eventually just be an alias for a DataArray with two dimensions. I think the DataSeries will eventually be worth building, but should probably evolve separately. To me (at least) the goals of a DataSeries just seem separate from the goals of a DataFrame.

StefanKarpinski commented 11 years ago

To me (at least) the goals of a DataSeries just seem separate from the goals of a DataFrame.

Fair enough. That's a hard judgement call, but an important one.

tshort commented 11 years ago

I think this can go either way. In R, I use both zoo objects and data.tables for time series stuff. Both approaches have their strengths. It just depends on what someone has the itch to develop.

There is some simple indexing stuff in DataFrames/src/indexing.jl (still experimental and not loaded). This is for indexing columns. So, you can play around with using a datetime column that is indexed.

johnmyleswhite commented 11 years ago

Just to be clear, I'm not even slightly opposed to making DataFrame's as usable as possible for time series data. I just don't know enough the ways in which people are using those techniques to say that we can achieve high efficiency with our current design.

milktrader commented 11 years ago

I need to do a little more research on why zoo/xts took the indexed matrix route. xts did because zoo did, and zoo did I'm pretty sure because of performance issues. Those same performance issues may not be an issue with Julia's DataFrame. @tshort I'll take a look at the indexing.jl as a start, thanks.

johnmyleswhite commented 11 years ago

Closed by c4ca8c7874c02fc16d084f7632c5cf9e45b2917d.

I've now gone through and cleaned up all of the constructors. There's list of all of them below, which will go into the manual soon

I like all of these constructors except for the tricked out DataVec[1, 2, NA] form, which I can't figure out how to make work for DataMatrix since a call to ref() can't contain semicolons. As you've probably noticed by now, I really, really dislike asymmetries between types.

A few of these constructors aren't actually implemented: neither DataMatrix(DataVec[1, NA], DataVec[NA, 2]) nor DataMatrix(1:3, 1:3) work. I'm not sure if they're really worth having, but would like to hear from others.

All of the constructors that are implemented as of now have tests in tests/constructors.jl.

StefanKarpinski commented 11 years ago

wow. lovely.

doobwa commented 11 years ago

Nice, John.

On Thu, Dec 13, 2012 at 1:48 PM, Stefan Karpinski notifications@github.comwrote:

wow. lovely.

— Reply to this email directly or view it on GitHubhttps://github.com/HarlanH/DataFrames.jl/issues/85#issuecomment-11354983.