Time-based DataFrame general discussion

milktrader commented 10 years ago

About a year ago I was advocating for time type row indices to be supported versus restricting them to Int. At that time, the general consensus was that it would be better to work around this idea and offer IndexedVector as an alternative to disrupting the package. Here is some history: https://github.com/JuliaStats/DataFrames.jl/issues/187

We now have the TimeData package that creates types based on DataFrames/DataArrays as an alternative to the IndexedVector approach. Since we have a general consensus that a time-based DataFrame can play an important role in time series analysis I thought I'd open this issue since it might be time to revisit the idea of supporting row indices of time type, at least to discuss it.

Currently, a time-based csv in raw DataFrame form looks like this:

ulia> using DataFrames, DataArrays, TimeSeries

julia> head(readtable(Pkg.dir("TimeSeries/test/data/spx.csv")))
6x7 DataFrame
|-------|--------------|--------|--------|--------|--------|----------|-----------|
| Row # | Date         | Open   | High   | Low    | Close  | Volume   | Adj Close |
| 1     | "1971-12-31" | 102.09 | 102.09 | 102.09 | 102.09 | 14040000 | 102.09    |
| 2     | "1971-12-30" | 101.78 | 101.78 | 101.78 | 101.78 | 13810000 | 101.78    |
| 3     | "1971-12-29" | 102.21 | 102.21 | 102.21 | 102.21 | 17150000 | 102.21    |
| 4     | "1971-12-28" | 101.95 | 101.95 | 101.95 | 101.95 | 15090000 | 101.95    |
| 5     | "1971-12-27" | 100.95 | 100.95 | 100.95 | 100.95 | 11890000 | 100.95    |
| 6     | "1971-12-23" | 100.74 | 100.74 | 100.74 | 100.74 | 16000000 | 100.74    |

TimeSeries takes the "Date" column, converts the string to DateTime format (formerly Calendar), and assigns the IndexedVector designation to it. Currently, IndexedVector has been sequestered as work is being done on core DataFrames functionality. https://github.com/JuliaStats/TimeSeries.jl/issues/48

There is the temptation after looking at the above DataFrame to make row indices::AbstractVector{Date{ISOCalendar}} versus the current rowindices::AbstractVector{Int} but this places the restriction (I think) that rows must be unique and ordered.

Clearly there would be a lot more code than that involved, and it would run counter to the principles of DataFrames. I don't see DataFrames supporting that natively. IndexedVector time column is a solution. The TimeData package is another solution.

This is simply a summary of where we are. We probably need some use cases to demonstrate how DataFrames work with time series analysis, and what each approach offers. It might be the case that what we have is good enough, and that most of the load for time series analysis should be shouldered by the still-in-protoype TimeArray type.

nalimilan commented 10 years ago

There is the temptation after looking at the above DataFrame to make row indices::AbstractVector{Date{ISOCalendar}} versus the current rowindices::AbstractVector{Int} but this places the restriction (I think) that rows must be unique and ordered.

Why would there be a difference between Int and ISOCalendar or DateTime?

I think you should list more precisely what you need that standard DataFrames do not support to help the discussion.

milktrader commented 10 years ago

Where I was going with changing the type of row indices was that you could do things like mydf[date(1980,1,1):days(4):date(1980,12,1)] and other similar getindex operations. I'm not proposing it gets implemented at all though. In fact, I think it would be too disruptive to the data structure.

nalimilan commented 10 years ago

This can be easily achieved if you create a new type inheriting from DataFrame with convenience indexing methods. Though in that case, I'd rather make the syntax be mydf[date(1980,1,1):days(4):date(1980,12,1),:] to avoid confusion.

milktrader commented 10 years ago

Yes, I think your indexing syntax is more Julian.

Would the new type be something along the lines of what the TimeData package implements?

type Timedata{T} <: AbstractTimedata
    vals::DataFrame
    idx::Array{T, 1}

    function Timedata(vals::DataFrame, idx::Array{T, 1})
        chkIdx(idx)
        if(size(vals, 1) != length(idx))
            if (length(idx) == 0) | (size(vals, 1) == 0)
                return new(DataFrame([]), Array{T, 1}[])
            end
            error(length(idx), " idx entries, but ", size(vals, 1), " rows of data")
        end
        return new(vals, idx)
    end
end

JuliaStats / Roadmap.jl

Time-based DataFrame general discussion #10